Phylogenetics: Morphology and Partitioning in MrBayes

From eebedia

Revision as of 10:03, 31 March 2011 by PaulLewis (Talk | contribs)
Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
The goal of this lab is twofold. First, you will learn how to analyze discrete morphological character data in MrBayes. Second, you will learn how to combine morphological with molecular data in a partitioned analysis in which each data type is assigned an appropriate model of evolution.

Contents

The Nylander et al. study

The data for this lab comes from a paper by Nylander et al. (2004) that has already become a landmark study in combining data within a Bayesian framework. The full citation is:

Nylander, J., F. Ronquist, J. P. Huelsenbeck, and J. Nieves-Aldrey. 2004. Bayesian phylogenetic analysis of combined data. Systematic Biology 53:47-67

If you have access, you can download the pdf of this paper.

Downloading the data file

The data from the paper is available in TreeBase in the form of five separate files. While you can download these separately (search by Study Accession number for S970), putting them together into a single MrBayes data file is tricky (for one thing, the sequences are not in the same order in all five files!), so I have done that for you. Download the file by clicking here and save a copy to your local hard drive. This nylander.nex data file contains three of the five data sets analyzed by Nylander et al. (2004). At the end of the file, note that there is the beginnings of a mrbayes block:

begin mrbayes;
 charset morph  = 1-166;
 charset 18S    = 167-1320;
 charset COI    = 1321-2397; 
end;

Each of the three lines already in the mrbayes block defines a charset, a meaningful set of characters. Each charset identifies one of the sources of data used in the Nylander et al. (2004) study. The first charset is named morph. While you can use any name you like to identify charsets, this name is appropriate because these are the discrete morphological characters. The second charset is named 18S because it contains 18S rRNA gene sequences. The third charset is named COI. This is a protein-coding gene in the mitochondrion that encodes part of the electron transport chain.

In this lab, you will build on this mrbayes block, but be sure to keep the three charset lines at the top.

Starting a log file

It is always a good idea to start a log file so that you can review the output produced by MrBayes at a later time:

log start file=output.txt replace;

Here, I've named my log file output.txt and instructed MrBayes to replace this file without asking if it already exists.

Reducing the size of the data set

In order to have time for an analysis during the lab period, we will need to delete some taxa:

 delete 1-11 19-21 26-32;

In addition, three morphological characters seem to cause problems for some models, so we will exclude these three characters:

 exclude 29 56 58;

Creating a data partition

Your first task is to tell MrBayes how to partition the data. Used as a verb, partition means to erect walls or dividers. The correct term for the data between two partitions (i.e. dividers) is subset, but data subsets are often, confusingly, also called partitions! Even more confusing, the entire collection of subsets is known as a partition! Add the following lines to the end of your mrbayes block to tell MrBayes that you want each of the 3 defined charsets to be a separate component (subset) of the partition:

partition mine = 3:morph,18S,COI;

The first number (3) is the number of subsets composing the partition. The name by which the partition is known in MrBayes is up to you: here I've chosen the name "mine".

Just defining a partition is not enough to get MrBayes to use it! You must tell MrBayes that you want to use the partition named "mine". This seems a little redundant, but the idea is that you can set up several different partitions and then easily turn them on or off simply by changing the partition named in the set command:

set partition=mine;

Specifying a model for the morphological data

The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.

The lset command

For each subset, you will create an lset and prset command. At the end of your existing mrbayes block, type in this lset command:

lset applyto=(1) coding=variable nbetacat=5; 

The applyto=(1) statement says that these settings apply only to the first (1) subset. The coding=variable statement instructs MrBayes to use the likelihood conditional on character variability, rather than the ordinary likelihood, which assumes that there will be an appropriate number of constant characters present. The nbetacat=5 is not really necessary, since 5 is the default for this setting, but I include it so that you know that it exists. This is the number of categories to use for the discrete Beta distribution used for state frequencies when analyzing morphological data under a model that allows state frequencies to vary across characters. It is analogous to the number of discrete gamma rate categories used for among site rate heterogeneity.

The prset command

Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the state frequencies (if those are allowed to vary). If you remember, in lecture I made a distinction between symmetric models (which imply equal state frequencies) and asymmetric models (in which the frequencies can be unequal). I mentioned that for the asymmetric model, one possibility is to allow variation in the equilibrium state frequency according to a discrete beta distribution.

To use a symmetric model in which the frequency of each state is 0.5 in MrBayes, create a Beta prior that is a spike concentrated right at the value 0.5:

prset applyto=(1) symdirihyperpr=fixed(infinity) ratepr=variable;

To allow more flexibility, create an assymetric model using a discrete Beta distribution to allow heterogeneity across characters in the state frequencies. (Note that this will be a Dirichlet distribution rather than a Beta distribution if a character has more than 2 states.) The Beta distribution used is always symmetrical, but you have some control over the variance. Recall that a Beta(2,2) has its mode at 0.5 but gently arcs from 0 to 1, whereas a Beta(100,100) distribution is peaked strongly over the value 0.5. We will create a model in which there is a hyperparameter governing the parameters of the Beta distribution and we will place an Exponential(1.0) hyperprior on that hyperparameter.

[prset applyto=(1) symdirihyperpr=exponential(1.0) ratepr=variable;]

Note that I have commented out this second prset command. This is because our first run will explore the pure symmetric model (where state frequencies are always exactly equal to 0.5 for 2-state characters), then we will do a second run to explore the asymmetric model (which will involve commenting out the first prset command and uncommenting the second).

You may be curious about the ratepr=variable statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. It is always a good idea to include the ratepr=variable statement in each subset of a partitioned model when using MrBayes.

Specifying a model for the 18S and COI data

For the 18S sequences, idealy we would use a secondary structure model, but unfortunately I don't know which sites form stems and loops for these data. Thus, we will have to make do with using a standard DNA sequence model (the GTR+I+G model) for this subset. For the COI data, we would (again, in an ideal world) specify a codon model because this is a protein-coding gene. However, we don't have the time in a 2 hour lab to allow a codon model to run to convergence, so we will also be applying the standard GTR+I+G model to the COI data. We will, however, allow the 18S and COI to occupy different subsets of the partition so that they can have different relative rates of substitution.

The lset command

The lset command is similar to ones you have used before with MrBayes:

 lset applyto=(2,3) nst=6 rates=invgamma;

The nst=6 tells MrBayes to use a 6-substitution-rate model (i.e. the GTR model) and the rates=invgamma part says to add invariable sites and discrete gamma rate heterogeneity onto the GTR base model. The applyto=(2,3) statement means that the remainder of the lset command only applies to the second (18S) and third (COI) subsets.

The prset command

Priors need to be specified for all the parameters of the GTR+I+G model except branch lengths and the gamma shape parameter. Because these (branch length and gamma shape priors) are being applied to all subsets, we'll specify these later.

prset applyto=(2,3) revmatpr=dirichlet(1,2,1,1,2,1) statefreqpr=dirichlet(2,2,2,2) pinvarpr=uniform(0,1) ratepr=variable;

This specifies a Dirichlet(1,2,1,1,2,1) prior for the 6 GTR relative substitution rate parameters. This prior is not flat, but it also not very informative: it specifies a slight preference for transitions over transversions. For the base frequencies, we're specifying a vague Dirichlet(2,2,2,2) prior. Again, slightly informative, attaching very weak rubber bands to each relative base frequency, keeping the relative frequencies from straying too far from 0.25. The pinvarpr=uniform(0,1) applies a flat, uniform prior to the pinvar parameter. The applyto=(2,3) statement applies its prior settings to the second (18S data) and third (COI) data subsets. The last ratepr=variable statement in each case ensures that each subset will have a relative rate parameter that allows it to evolve at different rate (potentially) than other subsets.

Specifying branch length and gamma shape priors

The branch length and gamma shape priors apply to all subsets, so we can specify them like this:

prset applyto=(all) brlenspr=unconstrained:exponential(1.0) shapepr=exponential(1.0);

This says to apply an exponential distribution with mean 1 to all branch lengths and to all shape parameters.

Unlinking parameters

By default, MrBayes tries to create the simplest model possible. If you have specified the same model parameters (and the same priors) for more than one subset, MrBayes will "link" these parameters across subsets. This means that it will assume that the values of these linked parameters apply to all subsets involved in the linking. If you want the gamma shape parameter (for example) to take on potentially different values for the morphology, 18S and COI subsets, you have to make sure MrBayes does not automatically link the shape parameter across all three subsets.

 unlink shape=(all) statefreq=(all) revmat=(all);

The above command ensures that different values will be used for the shape parameter, state frequencies and GTR relative rates for all data subsets.

Finishing the MrBayes block

Finish off the MrBayes block by adding mcmc and sump commands and possibly a quit command as follows:

 mcmc ngen=1000000 samplefreq=1000 printfreq=10000 nchains=1 nruns=1 startingtree=random;
 sump;
 quit; [include this only if you are submitting the run to the cluster]

This says to run a single Markov chain for 1,000,000 generations, saving a sample of the tree and model parameters every 1000 generations and showing progress every 10000 generations. Ordinarily, it would be a good idea to let MrBayes use the default values for nchains and nruns (4 and 2, respectively), but I am skimping here due to the time constraints imposed by a 2 hour lab period.

I have used the mcmc command here rather than the mcmcp command. While it is generally less dangerous to use the mcmcp command in your data file, you must use mcmc if you will be submitting this run to the cluster because the cluster has no way to interact with you after you submit the job using qsub.

Running MrBayes on your laptop

You can either set up your run on your own machine (recommended for today's lab) or use the Bioinformatics Facility cluster. I anticipate that for most real analyses, you will want to use the cluster to avoid tying up your laptop for long periods of time, so in the next section I have provided a short reminder of how to set up your run on the cluster.

You can follow the same procedure you used last week to run MrBayes. If you've forgotten the details, here is a link to last week's lab.

Running MrBayes on the cluster

You do not need to read this section unless you for some reason cannot run MrBayes on your laptop. If you need more info than what follows, go back to the lab where we first used the cluster.

First, login to bbcxsrv1.biotech.uconn.edu.

Second, create (in your home directory) a file named "go" (you can actually use any valid name you like, "go" is simply my personal preference) that looks like this:

#$ -o junk.txt -j y
#$ -m bea
#$ -M paul.lewis@uconn.edu 
cd $HOME/runmb
mb nylander.nex

The first line specifies that all output from MrBayes should go to the file junk.txt, which will appear in your home directory.

The second and third lines are optional, but quite useful. The second line says to send an email notification when the analysis begins (b), ends (e) and/or aborts (a). (So, if you only want an email to be sent when the run ends, you would use this instead: #$ -m e) The third line provides the email address to which to send these notifications. This must be a uconn.edu address (and it should NOT be MY email address; please change it to your email address in your copy!).

The fourth line causes MrBayes to be run from within the subdirectory runmb (you'll create that subdirectory next).

The fifth line actually runs MrBayes (mb), specifying nylander.nex as the file to execute.

Third, create a subdirectory named "runmb" under your home directory using the mkdir command:

cd $HOME
mkdir runmb

(Note that you actually do not need to type $HOME: typing just cd takes you to your home directory.)

Fourth, copy your data file (which should be named nylander.nex) into that runmb directory after transferring it to the cluster using Fugu or FileZilla:

cd $HOME
cp nylander.nex runmb

Fifth, start your run from your home directory using the qsub command:

cd $HOME
qsub go

You can periodically check the status of your run using

qstat

You can view the output that MrBayes would normally send to the screen by typing (from your home directory)

cat junk.txt

You will find the output files that MrBayes generates in the subdirectory runmb.

Analyzing the output

Try to answer these questions based on the log file. If you don't feel you know the answer to any of these, feel free to ask us for help!

  • Find the section labeled "Active parameters" Can you tell from this table that separate gamma shape parameters will be used for 18S and COI genes? Can you tell that the same topology will be applied to all three partition subsets?
  • Why is there a warning that "There are 25 characters incompatible with the specified coding bias"?
  • What is the log of the harmonic mean of the likelihood for this model? (write this down for later reference)
  • What is unusual about the state frequencies of the COI gene? (This is actually typical of genes like COI that are in compartments such as mitochondria and chloroplasts.)
  • Which subset of characters evolves the fastest on average? Which evolves most slowly? (hint: look at the subset relative rate parameters m{1}, m{2} and m{3})

Now open Tracer and load your nylander.nex.p file into the program. Click on the Trace tab to see a plot of parameter values against State (which is how Tracer refers to the generation or iteration).

  • On the left, in the ESS (Effective Sample Size) column, are any values red (<100) or yellow (<200)? Looking at the trace plot, can you suggest something you might do in a future run to increase the EES of this parameter?
  • On the left, select m{1}, m{2} and m{3} at the same time (hold down the shift key as you click them). Which of these subset relative rates are you least certain about (i.e. has the largest marginal posterior variance)
  • Now do the same for pi(A){3}, pi(C){3}, pi(G){3}, and pi(T){3}. This should make the AT bias at the COI gene clear.

Asymmetric model

Now move your output files into a new directory (to keep them from being overwritten) and start a second run after changing the prset command for subset 1 (the morphology partition) to the one commented out before. This will enable an asymmetric morphology model in which each character is allowed to have a different forward/reverse substitution rate (which entails also having unequal state frequencies). Once the run is finished, try answering the following questions using the log file and Tracer:

  • What is the log of the harmonic mean of the likelihood for this model? According to the harmonic mean method (albeit not the best, but all we have in this case), does this model fit better than the symmetric model on average?
  • In Tracer, are there more or fewer quantities with low (red or yellow) ESS values?
  • Take a look at the trace plot of m{1} in Tracer. Can you diagnose the problem here? What could you do to improve your next run?

That's it for today

This has been an unusual lab because we have not even looked at any phylogenetic trees resulting from these analyses! My purpose here is to show you how to set up a partitioned run with a mixture of morphological characters and sequence data, and force you to notice a few things about the output. Because I am not an expert in this group, and because we had to exclude most of the taxa and several genes, any trees produced from our analyses may not mean that much anyway. The original paper is worth reading if you end up combining morphological and molecular characters in a Bayesian analysis. Nylander et al. do a great job of discussing the many issues that arise from partitioning the data and combining morphological characters with DNA sequences, and you should now have the background to understand a paper like this.

Personal tools