Phylogenetics: MrBayes Lab

From EEBedia
Revision as of 22:06, 17 March 2007 by PaulLewis (Talk | contribs) (Starting MrBayes)

Jump to: navigation, search
Under construction.png This article is still under construction.
Expect it to change frequently until this notice is removed.
Adiantum.png EEB 349: Phylogenetics
The goal of this lab exercise is to introduce you to the most important software for conducting Bayesian phylogenetic analyses, MrBayes.

Download and save the data file

Save the contents of the file algaemb.nex to a file. This is the same 16S data you have used before (under the name algae.nex), but the original file had to be changed to accommodate MrBayes' eccentricities. While MrBayes does a fair job of reading Nexus files, it chokes on certain constructs. The information about what I had to change in order to get it to work is in a comment at the top of the file (this might be helpful in converting other nexus files to work with MrBayes in the future).

Create a shortcut to the MrBayes executable

When MrBayes begins, it expects the data file to be in the same folder as itself. One way to allow your data to reside in a different place than MrBayes is to create a Windows shortcut to the program. Open the folder where you saved your data file, and right-click on empty space (i.e. not on the name of an existing file). In the menu that appears, select New > Shortcut. This should bring up the Create Shortcut dialog box. Click the Browse... button and locate and choose the mrbayes.exe file. Now click the Next button and choose a name for your shortcut (or accept the default name). Click the Finish button to close the dialog box and create the shortcut. Now right-click your shortcut and choose Properties from the menu that appears. Specify the full path to the directory containing the algaemb.nex data file in the Start in: box, then click the Ok button.

Starting MrBayes

Start MrBayes by double-clicking the shortcut you just created. When you see the MrBayes > prompt, type the following and then press the Enter key:

exe algaemb.nex

MrBayes should respond with some information, ending with the encouraging words "Successfully read matrix".

Creating a MRBAYES block

The next step is to set up an MCMC analysis. There are three commands in particular that you will need to know about in order to set up a typical run: lset, prset and mcmc. The command mcmcp is identical to mcmc except that it does not actually start a run. For each of these commands you can obtain online information by typing help followed by the command name: for example, help prset. This may spew more information than will fit on your screen, however.

Create a MRBAYES block in your Nexus data file. MrBayes does not have an editor, but you can use PAUP* to edit the algaemb.nex data file. Add the following at the bottom of the file to begin creating the MRBAYES block:

begin mrbayes;
end; 

Note that commands in MrBayes are (intentionally) similar to those in PAUP*, but the differences can be frustrating. For instance, lset ? in PAUP* gives you information about the current likelihood settings, but this does not work in MrBayes. Instead, you type help lset. Also, the lset command in MrBayes has many options not present in PAUP*, and vice versa. Get used to typing help followed by the name of a command in MrBayes to see what options are allowed for a particular command.

Specifying priors

It is fine to copy from the web page and paste into MrBayes, but be aware that in order to paste into the MrBayes window you must right-click and choose the Paste option from the popup menu: The normal Ctrl-V key combination will not work in a Windows console window.

Prior on branch lengths

prset brlenspr=unconstrained:exp(10.0)

The above command specifies that branch lengths are to be unconstrained (i.e. a molecular clock is not assumed) and the prior distribution to be applied to each branch length is an exponential distribution with mean 1/10. Note that the value you specify for unconstrained:exp is the inverse of the mean.

Prior on (discrete gamma rate heterogeneity) shape parameter

prset shapepr=exp(1.0)

This command specifies an exponential distribution with mean 1.0 for the shape parameter of the gamma distribution we will use to model rate heterogeneity. Note that we have not yet told MrBayes that we wish to assume that substitution rates are variable - we will do that using the lset command below.

Prior on kappa

prset tratiopr=beta(1.0,1.0)

The command above says to use a Beta(1,1) distribution as the prior for the transition/transversion rate ratio. This may seem like a strange distribution to use for a parameter that ranges from 0 to infinity, but the reason for this is that MrBayes treats this parameter as if it were a proportion. That is, the rate of transitions (alpha) and the rate of transversions (beta) are normalized so that they add to 1.0. The rate ratio kappa can always be obtained as alpha/beta, so the constraint that alpha + beta = 1 does not affect the likelihood calculations.

Prior on base frequencies

 prset statefreqpr=dirichlet(1.0,1.0,1.0,1.0)

The above command states that a flat dirichlet distribution is to be used for base frequencies. The Dirichlet distribution is like the Beta distribution, except that it is applicable to combinations of parameters. Like the Beta distribution, the distribution is symmetrical if all the parameters of the distribution are equal, and the distribution is flat if all the parameters of the distribution are equal to 1.0. Using the command above specifies a flat Dirichlet prior, which says that any combination of base frequencies will be given equal prior weight. This means that (0.01, 0.99, 0.0, 0.0) is just as probable, a priori, as (0.25, 0.25, 0.25, 0.25). If you wanted base frequencies to not stray much from (0.25, 0.25, 0.25, 0.25), you could specify, say, statefreqpr=dirichlet(10.0,10.0,10.0,10.0) instead.

Specifying the structure of the model

lset nst=2 rates=gamma ngammacat=4 

We are finished setting priors now, so the command above finishes our specification of the model by telling MrBayes that we would like a 2-parameter substitution matrix (i.e. the rate matrix has only two substitution rates, the transition rate and the transversion rate). It also specifies that we would like rates to vary across sites according to a gamma distribution with 4 categories.

Specifying MCMC options

mcmcp ngen=10000 samplefreq=10 printfreq=100 nchains=2 startingtree=random savebrlens=yes

The command above specifies most of the remaining details of the analysis. ngen=10000 tells MrBayes that its robots should each take 10,000 steps. You should ordinarily use much larger values for ngen than this (the default is 1 million steps). We're keeping it small here because some of the computers we are using are excruciatingly slow. samplefreq=10 says to only save parameter values and the tree topology every 10 steps. printfreq=100 says that we would like a progress report every 100 steps. nchains=2 says that we would like to have one heated chain running in addition to the cold chain. Ordinarily you would use at least 3 heated chains; again, we're skimping here to keep things fast while you are learning the ropes. startingtree=random means that we wish to begin from a random starting tree (as opposed to starting with a specific pre-determined tree topology). Finally, savebrlens=yes tells MrBayes that we would like it to save branch lengths when it saves the sampled tree topologies.

Finalizing the MRBAYES block

outgroup Anacystis_nidulans

This merely affects the display of trees. It says we want trees to be rooted between the taxon Anacystis_nidulans and everything else.

set autoclose=yes

This tells MrBayes that we will not want to continue the run beyond the 10,000 generations specified. If you leave this out, MrBayes will ask you whether you wish to continue running the chains after the specified number of generations is finished.

Starting the analysis

mcmc

This command starts the run. While MrBayes runs, it shows one-line progress reports. The first column is the step number. The next two columns show the log-likelihoods of the separate chains that are running, with the cold chain indicated by square brackets rather than parentheses. The last complete column is a prediction of the time remaining until the run completes. The columns consisting of only -- are simply separators, they have no meaning.

  • Do you see evidence that the cold chain and heated chain are swapping with each other?
  • Could you determine from this output the total number of times the cold chain and heated chain successfully swapped with each other?

BEGIN AGAIN HERE

When it has finished, it will report various statistics about the run, such as the percentage of the time it was able to accept proposed changes of various sorts. These percentages should, idealy, all be between about 20% and 40%, but as long as they are not extreme (e.g. 1% or 99%) then things went well.

 The section entitled State exchange information reports the number of times the cold chain attempted to swap with the heated chain (lower left) and the proportion of time such attempts were successful (upper right). In our case, since there were only 2 chains, the number of attempted swaps should be the same as the number of generations, i.e. 10,000. If you had run 4 chains, each generation a random pair of chains would attempt a swap, so none of the numbers in the lower triangle would be as large as the number of generations specified.
 MrBayes saves information in several files. Only two of these will concern us today.
   One of them will be called algaemb.nex.out.p. This is 
   the file in which the sampled  parameter values were saved. This file is saved in tab-delimited
   format so that it is easy to import into a spreadsheet program such as Excel. We will examine this file graphically in a moment, but first let's get MrBayes to summarize its contents for us.
   
   At the MrBayes prompt, type the command sump. This will generate a crude graph showing the log-likelihood as a function of time. Note that the log-likelihood starts out low on the left (you started from a random tree, remember), then quickly climbs to a constant value.
   
   Below the graph, MrBayes provides the arithmetic mean and harmonic mean of the marginal likelihood. The harmonic mean is useful in calculating Bayes factors, which are in turn useful for deciding among different models. We will talk about how to use this value in lecture.
   
   The table at the end is quite useful. It shows the posterior mean, median, variance and 95% credible interval for each parameter in your model based on the samples taken during the run. The credible interval shows the range of values of a parameter that account for the middle 95% of its marginal posterior distribution. If the credible interval for kappa is 3.8 to 6.8, then you can say that there is a 95% chance that kappa is between 3.8 and 6.8 given your data (and of course the model). The parameter TL represents the sum of all the branch lengths. Rather than reported every branch length individually, MrBayes just keeps track of their sum. 


 Now type the command sumt. This will summarize the trees that have been saved in the file algaemb.nex.out.t. 
   
   The output of this command includes a bipartition (=split) table, showing posterior probabilities for every splits found in any tree sampled during the run. After the bipartition table is shown a majority-rule consensus tree containing all splits that had posterior probability 0.5 or above. 
   
   If you chose to save branch lengths (and we did), MrBayes shows a second tree in which each branch is displayed in such a way that branch lengths are proportional to their posterior mean. What does this bit of mumbo-jumbo mean? MrBayes keeps a running sum of the branch lengths for particular splits it finds in trees as it reads the file algaemb.nex.out.t. Before displaying this tree, it divides the sum for each split by the total number of times it encountered the split to get a simple average branch length for each split. It then draws the tree so that branch lengths are proportional to these mean branch lengths.
   
   Finally, the last thing the sumt command does is tell you how many tree topologies are in credible sets of various sizes. For example, in my run, it said that the 99% credible set contained 18 trees. What does this tell us? MrBayes orders tree topologies from most frequent to least frequent (where frequency refers to the number of times they appear in algaemb.nex.out.t). To construct the 99% credible set of trees, it begins by adding the most frequent tree to the set. If that tree accounts for 99% or more of the posterior probability (i.e. at least 99% of all the trees in the algaemb.nex.out.t file have this topology), then MrBayes would say that the 99% credible set contains 1 tree. If the most frequent tree topology was not that frequent, then MrBayes would add the next most frequent tree topology to the set. If the combined posterior probability of both trees was at least 0.99, it would say that the 99% credible set contains 2 trees. In our case, it had to add the top 18 trees to get the total posterior probability up to 99%. 


 Type quit (or just q), to quit MrBayes now.


Using Tracer to summarize MCMC results


     Instead of using Excel to examine the contents of this file, we will use a program called Tracer designed specifically for this purpose. Tracer can be downloaded free from the <a href="http://evolve.zoo.ox.ac.uk/software.html">Oxford Evolutionary Biology Group web site</a>, which by the way is a good place to find many useful programs for phylogenetic analysis.
       
       Double-click the shortcut for Tracer on your desktop to start the program. Choose File | Import... (or use the key combination Ctrl-I) to choose a parameter sample file to display. Select the algaemb.nex.out.p in your working folder, then click the Open button to read it. 
   
   
      You should now see 
         8 rows of values in the table labeled Traces on the left side of the main window. The first row (LnL) is selected by default, and Tracer shows a histogram of log-likelihood values on the right, with summary statistics above the histogram.
         
         A histogram is perhaps not the most useful plot to make with the LnL values. Click the Trace tab to see the kind of plot I termed a history plot in lecture.
       Tracer determines the burn-in period using an undocumented algorithm. You may wish to be more conservative than Tracer. Opinions vary about burn-in. Some Bayesians feel it is important to exclude the first few samples because it is obvious that the chains have not reached stationarity at this point. Other Bayesians feel that if you are worried about the effect of the earliest samples, then you definitely have not run your chains long enough!
   
   Click the Estimates tab again at the top, then click the row labeled kappa on the left. 
  • What is the posterior mean of kappa?
  • What is the 95% credible interval for kappa? Click the row labeled alpha on the left. This is the shape parameter of the gamma distribution governing rates across sites.
  • What is the posterior mean of alpha?
  • What is the 95% credible interval for alpha?
  • Is there rate heterogeneity among site, or are all sites evolving at nearly the same rate? Click on the row labeled TL on the left (the Tree Length "parameter").
  • What is the posterior mean tree length?
  • If all branches were equal in length, what would be the mean length of any individual edge? (Hint: divide the tree length by the number of edges, which is 2n-3 if n is the number of taxa.) Running MrBayes with no data Why would you want to run MrBayes with no data. Here's a reason. You discover that MrBayes assumes, by default, the following branch length prior: exp(10). What does the 10 mean here? Is this an exponential distribution with mean 10 or is 10 the so-called "hazard" parameter (a common way to parameterize the exponential distribution)? If 10 is correctly interpreted as the hazard parameter, then the mean of the distribution is 1/hazard, or 0.1. Ideally, the documentation of MrBayes should tell you this sort of information, but often such details are accidentally left out. Also, it is not possible to place prior distributions directly on some quantities of interest. For example, while you can specify a flat prior on topologies, it is not possible to place a prior on a particular split you are interested in. This is because the prior distribution of splits is an induced prior; that is, it is determined indirectly by the prior you place on topologies. Running a Bayesian MCMC program without data is a good way to make sure you know what priors you are actually placing on the quantities of interest. Even if you think you know, running dataless provides a good sanity check. If there is no information in the data, the likelihood is 1.0 (and the log-likelihood is thus 0.0) for every parameter combination and tree topology. Running MrBayes without data thus effectively takes the likelihood out of Bayes formula, and the posterior distribution equals the prior distribution. An MCMC analysis thus provides an approximation of the prior. Let's see how this works... Execute the file blank.nex in MrBayes. Here is what this file looks like (feel free to look at it in PAUP*'s editor): #nexus begin data; dimensions ntax=4 nchar=1; format datatype=dna missing=? gap=-; matrix A ? B ? C ? D ?  ; end; begin mrbayes; prset brlenspr=unconstrained:exp(10.0); prset tratiopr=beta(1.0,1.0); prset statefreqpr=Dirichlet(1.0,1.0,1.0,1.0); prset shapepr=exp(1.0); lset nst=2 rates=gamma ngammacat=4; mcmcp ngen=1000000 samplefreq=100 printfreq=10000; end; Note that the data matrix is composed of just one character, and that character is composed of only missing data! Type mcmc to perform the analysis. Because calculation of the likelihood is the most time-consuming part of a Bayesian analysis, this analysis will go quickly because the likelihood is very simple in this case. Import the output file blank.nex.out.p in Tracer. Look first at the histogram of alpha, the shape parameter of the gamma distribution.
  • What is the mean you expected for alpha based on the prset shapepr=exp(1.0) command in the blank.nex file?
  • What is the posterior mean actually estimated by MrBayes (and presented by Tracer)?
  • An exponential distribution always starts high and approaches zero as you consider higher and higher values. The highest point of the exponential density function is 1/mean for an exponential distribution. If you look at the approximated density plot (click on the Density tab), does it appear to approach 1/mean at the value alpha=0.0? Now look at the histogram of TL, the tree length.
  • What is the posterior mean of TL, as reported by Tracer?
  • What value did you expect based on the prset brlenspr=unconstrained:exp(10) command?
  • Does the approximated posterior distribution of TL appear to be an exponential distribution? The second and third questions are a bit tricky, so I'll just give you the explanation. Please make sure this explanation makes sense to you, however, and ask us to explain further if it doesn't make sense. We told MrBayes to place an exponential prior with mean 0.1 on each branch. There are 5 branches in a 4-taxon, unrooted tree. Thus, 5 times 0.1 equals 0.5, which should be close to the posterior mean you obtained for TL. That part is fairly straightforward. The marginal distribution of TL does not look at all like an exponential distribution, despite the fact that TL should be the sum of 5 exponential distributions. The famous central limit theorem in statistics says that if you add together several independent random quantities, the distribution of the sum will approximate a normal distribution. The approximation gets better the more items there are in the sum. Also, it doesn't even matter how the individual random quantities are distributed , their sum will have a normal distribution (or something close to a normal distribution). If you have trouble believing this, play with the <a href="../../../applets/clt/cltdemo.html">Java applet</a> I wrote to demonstrate (to myself!) the Central Limit Theorem. That's it for the lab today. You can look at plots of the other parameters if you like. You should also spend some time opening the other output files MrBayes produces in PAUP*'s editor to make sure you understand what information is saved in these files. Note that some of MrBayes' output files are actually Nexus tree files, which you can open in TreeView. For example, algaemb.nex.con contains the consensus tree from the Bayesian analysis, and algaemb.nex.t contains the sampled trees. The file algaemb.nex.trprobs contains all distinct tree topologies, sorted from highest to lowest posterior probability. You may wish to add the suffix ".tre" to these files so that TreeView recognizes them as tree files (allowing you to open them in TreeView with a double-click).