Phylogenetics: MrBayes Lab
This article is still under construction. Expect it to change frequently until this notice is removed. |
EEB 349: Phylogenetics | |
The goal of this lab exercise is to introduce you to the most important software for conducting Bayesian phylogenetic analyses, MrBayes. |
Contents
Download and save the data file
Save the contents of the file algaemb.nex to a file. This is the same 16S data you have used before (under the name algae.nex), but the original file had to be changed to accommodate MrBayes' eccentricities. While MrBayes does a fair job of reading Nexus files, it chokes on certain constructs. The information about what I had to change in order to get it to work is in a comment at the top of the file (this might be helpful in converting other nexus files to work with MrBayes in the future).
Create a shortcut to the MrBayes executable
When MrBayes begins, it expects the data file to be in the same folder as itself. One way to allow your data to reside in a different place than MrBayes is to create a Windows shortcut to the program. Open the folder where you saved your data file, and right-click on empty space (i.e. not on the name of an existing file). In the menu that appears, select New > Shortcut. This should bring up the Create Shortcut dialog box. Click the Browse... button and locate and choose the mrbayes.exe file. Now click the Next button and choose a name for your shortcut (or accept the default name). Click the Finish button to close the dialog box and create the shortcut. Now right-click your shortcut and choose Properties from the menu that appears. Specify the full path to the directory containing the algaemb.nex data file in the Start in: box, then click the Ok button.
Starting MrBayes
Start MrBayes by double-clicking the shortcut you just created. When you see the MrBayes > prompt, type the following and then press the Enter key:
exe algaemb.nex
MrBayes should respond with some information, ending with the encouraging words "Successfully read matrix".
Creating a MRBAYES block
The next step is to set up an MCMC analysis. There are three commands in particular that you will need to know about in order to set up a typical run: lset, prset and mcmc. The command mcmcp is identical to mcmc except that it does not actually start a run. For each of these commands you can obtain online information by typing help followed by the command name: for example, help prset. This may spew more information than will fit on your screen, however.
Create a MRBAYES block in your Nexus data file. MrBayes does not have an editor, but you can use PAUP* to edit the algaemb.nex data file. Add the following at the bottom of the file to begin creating the MRBAYES block:
begin mrbayes; end;
Note that commands in MrBayes are (intentionally) similar to those in PAUP*, but the differences can be frustrating. For instance, lset ? in PAUP* gives you information about the current likelihood settings, but this does not work in MrBayes. Instead, you type help lset. Also, the lset command in MrBayes has many options not present in PAUP*, and vice versa. Get used to typing help followed by the name of a command in MrBayes to see what options are allowed for a particular command.
Specifying the prior on branch lengths
prset brlenspr=unconstrained:exp(10.0)
The above command specifies that branch lengths are to be unconstrained (i.e. a molecular clock is not assumed) and the prior distribution to be applied to each branch length is an exponential distribution with mean 1/10. Note that the value you specify for unconstrained:exp is the inverse of the mean.
Specifying the prior on the gamma shape parameter
prset shapepr=exp(1.0)
This command specifies an exponential distribution with mean 1.0 for the shape parameter of the gamma distribution we will use to model rate heterogeneity. Note that we have not yet told MrBayes that we wish to assume that substitution rates are variable - we will do that using the lset command below.
Specifying the prior on kappa
prset tratiopr=beta(1.0,1.0)
The command above says to use a Beta(1,1) distribution as the prior for the transition/transversion rate ratio. This may seem like a strange distribution to use for a parameter that ranges from 0 to infinity, but the reason for this is that MrBayes treats this parameter as if it were a proportion. That is, the rate of transitions (alpha) and the rate of transversions (beta) are normalized so that they add to 1.0. The rate ratio kappa can always be obtained as alpha/beta, so the constraint that alpha + beta = 1 does not affect the likelihood calculations.
Specifying a prior on base frequencies
prset statefreqpr=dirichlet(1.0,1.0,1.0,1.0)
The above command states that a flat dirichlet distribution is to be used for base frequencies. The Dirichlet distribution is like the Beta distribution, except that it is applicable to combinations of parameters. Like the Beta distribution, the distribution is symmetrical if all the parameters of the distribution are equal, and the distribution is flat if all the parameters of the distribution are equal to 1.0. Using the command above specifies a flat Dirichlet prior, which says that any combination of base frequencies will be given equal prior weight. This means that (0.01, 0.99, 0.0, 0.0) is just as probable, a priori, as (0.25, 0.25, 0.25, 0.25). If you wanted base frequencies to not stray much from (0.25, 0.25, 0.25, 0.25), you could specify, say, statefreqpr=dirichlet(10.0,10.0,10.0,10.0) instead.
The lset command
lset nst=2 rates=gamma ngammacat=4
We are finished setting priors now, so the command above finishes our specification of the model by telling MrBayes that we would like a 2-parameter substitution matrix (i.e. the rate matrix has only two substitution rates, the transition rate and the transversion rate). It also specifies that we would like rates to vary across sites according to a gamma distribution with 4 categories.
Specifying MCMC options
mcmcp ngen=10000 samplefreq=10 printfreq=100 nchains=2 startingtree=random savebrlens=yes
The command above specifies most of the remaining details of the analysis. ngen=10000 tells MrBayes that its robots should each take 10,000 steps. You should ordinarily use much larger values for ngen than this (the default is 1 million steps). We're keeping it small here because some of the computers we are using are excruciatingly slow. samplefreq=10 says to only save parameter values and the tree topology every 10 steps. printfreq=100 says that we would like a progress report every 100 steps. nchains=2 says that we would like to have one heated chain running in addition to the cold chain. Ordinarily you would use at least 3 heated chains; again, we're skimping here to keep things fast while you are learning the ropes. startingtree=random means that we wish to begin from a random starting tree (as opposed to starting with a specific pre-determined tree topology). Finally, savebrlens=yes tells MrBayes that we would like it to save branch lengths when it saves the sampled tree topologies.
Specifying an outgroup
outgroup Anacystis_nidulans
This merely affects the display of trees. It says we want trees to be rooted between the taxon Anacystis_nidulans and everything else.
The autoclose command
set autoclose=yes
This tells MrBayes that we will not want to continue the run beyond the 10,000 generations specified. If you leave this out, MrBayes will ask you whether you wish to continue running the chains after the specified number of generations is finished.
Running MrBayes and interpreting the results
mcmc
This command starts the run. While MrBayes runs, it shows one-line progress reports. The first column is the step number. The next two columns show the log-likelihoods of the separate chains that are running, with the cold chain indicated by square brackets rather than parentheses. The last complete column is a prediction of the time remaining until the run completes. The columns consisting of only -- are simply separators, they have no meaning.
- Do you see evidence that the cold chain and heated chain are swapping with each other?
- Could you determine from this output the total number of times the cold chain and heated chain successfully swapped with each other?
Interpreting the results
When it has finished, it will report various statistics about the run, such as the percentage of time MrBayes was able to accept proposed changes of various sorts. These percentages should, idealy, all be between about 20% and 40%, but as long as they are not extreme (e.g. 1% or 99%) then things went well.
The section entitled "State exchange information" reports the number of times the cold chain attempted to swap with the heated chain (lower left) and the proportion of time such attempts were successful (upper right). In our case, since there were only 2 chains, the number of attempted swaps should be the same as the number of generations, i.e. 10,000. If you had run 4 chains, each generation a random pair of chains would attempt a swap, so none of the numbers in the lower triangle would be as large as the number of generations specified.
The sump command
MrBayes saves information in several files. Only two of these will concern us today. One of them will be called algaemb.nex.out.p. This is the file in which the sampled parameter values were saved. This file is saved in tab-delimited format so that it is easy to import into a spreadsheet program such as Excel. We will examine this file graphically in a moment, but first let's get MrBayes to summarize its contents for us.
At the MrBayes prompt, type the command sump. This will generate a crude graph showing the log-likelihood as a function of time. Note that the log-likelihood starts out low on the left (you started from a random tree, remember), then quickly climbs to a constant value.
Below the graph, MrBayes provides the arithmetic mean and harmonic mean of the marginal likelihood. The harmonic mean is used in estimating Bayes factors, which are in turn useful for deciding which among different models fits the data best on average. We will talk about how to use this value in lecture.
The table at the end is quite useful. It shows the posterior mean, median, variance and 95% credible interval for each parameter in your model based on the samples taken during the run. The credible interval shows the range of values of a parameter that account for the middle 95% of its marginal posterior distribution. If the credible interval for kappa is 3.8 to 6.8, then you can say that there is a 95% chance that kappa is between 3.8 and 6.8 given your data (and of course the model). The parameter TL represents the sum of all the branch lengths. Rather than reported every branch length individually, MrBayes just keeps track of their sum.
The sump command
Now type the command sumt. This will summarize the trees that have been saved in the file algaemb.nex.out.t.
The output of this command includes a bipartition (=split) table, showing posterior probabilities for every splits found in any tree sampled during the run. After the bipartition table is shown a majority-rule consensus tree containing all splits that had posterior probability 0.5 or above. If you chose to save branch lengths (and we did), MrBayes shows a second tree in which each branch is displayed in such a way that branch lengths are proportional to their posterior mean. What does this bit of mumbo-jumbo mean? MrBayes keeps a running sum of the branch lengths for particular splits it finds in trees as it reads the file algaemb.nex.out.t. Before displaying this tree, it divides the sum for each split by the total number of times it encountered the split to get a simple average branch length for each split. It then draws the tree so that branch lengths are proportional to these mean branch lengths. Finally, the last thing the sumt command does is tell you how many tree topologies are in credible sets of various sizes. For example, in my run, it said that the 99% credible set contained 18 trees. What does this tell us? MrBayes orders tree topologies from most frequent to least frequent (where frequency refers to the number of times they appear in algaemb.nex.out.t). To construct the 99% credible set of trees, it begins by adding the most frequent tree to the set. If that tree accounts for 99% or more of the posterior probability (i.e. at least 99% of all the trees in the algaemb.nex.out.t file have this topology), then MrBayes would say that the 99% credible set contains 1 tree. If the most frequent tree topology was not that frequent, then MrBayes would add the next most frequent tree topology to the set. If the combined posterior probability of both trees was at least 0.99, it would say that the 99% credible set contains 2 trees. In our case, it had to add the top 18 trees to get the total posterior probability up to 99%.
Type quit (or just q), to quit MrBayes now.
Using Tracer to summarize MCMC results
Instead of using Excel to examine the contents of this file, we will use a program called Tracer designed specifically for this purpose. Tracer can be downloaded free from the <a href="http://evolve.zoo.ox.ac.uk/software.html">Oxford Evolutionary Biology Group web site</a>, which by the way is a good place to find many useful programs for phylogenetic analysis. Double-click the shortcut for Tracer on your desktop to start the program. Choose File | Import... (or use the key combination Ctrl-I) to choose a parameter sample file to display. Select the algaemb.nex.out.p in your working folder, then click the Open button to read it. You should now see 8 rows of values in the table labeled Traces on the left side of the main window. The first row (LnL) is selected by default, and Tracer shows a histogram of log-likelihood values on the right, with summary statistics above the histogram. A histogram is perhaps not the most useful plot to make with the LnL values. Click the Trace tab to see the kind of plot I termed a history plot in lecture. Tracer determines the burn-in period using an undocumented algorithm. You may wish to be more conservative than Tracer. Opinions vary about burn-in. Some Bayesians feel it is important to exclude the first few samples because it is obvious that the chains have not reached stationarity at this point. Other Bayesians feel that if you are worried about the effect of the earliest samples, then you definitely have not run your chains long enough! Click the Estimates tab again at the top, then click the row labeled kappa on the left.