Phylogenetics: Morphology and Partitioning in MrBayes

From EEBedia
Revision as of 13:06, 30 March 2009 by PaulLewis (Talk | contribs) (Finishing the MrBayes block)

Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
The goal of this lab is twofold. First, you will learn how to analyze discrete morphological character data in MrBayes. Second, you will learn how to combine morphological with molecular data in a partitioned analysis in which each data type is assigned an appropriate model of evolution.

The Nylander et al. study

The data for this lab comes from a paper by Nylander et al. (2004) that has already become a landmark study in combining data within a Bayesian framework. The full citation is:

Nylander, J., F. Ronquist, J. P. Huelsenbeck, and J. Nieves-Aldrey. 2004. 
Bayesian phylogenetic analysis of combined data. Systematic Biology 53:47-67.

If you have access, you can download the pdf of this paper.

Downloading the data file

The data from the paper is available in TreeBase in the form of five separate files. While you can download these separately (search by Study Accession number for S970), putting them together into a single MrBayes data file is tricky (for one thing, the sequences are not in the same order in all five files!), so I have done that for you. Download the file by clicking here and save a copy to your local hard drive. This nylander.nex data file contains three of the five data sets analyzed by Nylander et al. (2004). At the end of the file, note that there is the beginnings of a mrbayes block:

begin mrbayes;
 charset morph  = 1-166;
 charset 18S    = 167-1320;
 charset COI    = 1321-2397; 
end;

Each of the three lines already in the mrbayes block defines a charset, a meaningful set of characters. Each charset identifies one of the sources of data used in the Nylander et al. (2004) study. The first charset is named morph. While you can use any name you like to identify charsets, this name is appropriate because these are the discrete morphological characters. The second charset is named 18S because it contains 18S rRNA gene sequences. The third charset is named COI. This is protein-coding gene in the mitochondrion that encodes part of the electron transport chain.

In this lab, you will build on this mrbayes block, but be sure to keep the three charset lines at the top.

Creating a data partition

Your first task is to tell MrBayes how to partition the data. Used as a verb, partition means to erect walls or dividers. The correct term for the data between two partitions (i.e. dividers) is subset, but data subsets are often, confusingly, also called partitions! Even more confusion, the entire collection of subsets is known as a partition! Add the following lines to the end of your mrbayes block to tell MrBayes that you want each of the 3 defined charsets to be a separate component (subset) of the partition:

partition mine = 3:morph,18S,COI;

The first number (3) is the number of subsets composing the partition. The name by which the partition is known in MrBayes is up to you: here I've chosen the name "mine".

Just defining a partition is not enough to get MrBayes to use it! You must tell MrBayes that you want to use the partition named "mine". This seems a little redundant, but the idea is that you can set up several different partitions and then easily turn them on or off to see the effects simply by changing the partition named in the set command:

set partition=mine;

Which group are you in?

I have assigned each of you to one of three groups. Each group will try a different model, with the hope that by the end of the day we can learn something about the effects of these models. None of the models we will try were used by Nylander et al. (2004), so we're striking out on our own here. In the material that follows, look out for statments like this: "Groups 1 and 2 should use this line..." and act appropriately given the group to which you were assigned.

Specifying a model for the morphological data

The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.

The lset command

For each subset, you will create an lset and prset command. At the end of your existing mrbayes block, type in this lset command:

lset applyto=(1) coding=variable rates=gamma; 

The applyto=(1) statement says that these settings apply only to the first (1) subset. The coding=variable statement instructs MrBayes to use the likelihood conditional on character variability, rather than the ordinarly likelihood, which assumes that there will be an appropriate number of constant characters present. Finally, the last rates=gamma statement says that we would like the rate of evolution to vary from one character to another according to the usual discrete gamma model of among-site rate heterogeneity.

The prset command

Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the gamma shape parameter. If you remember, in lecture I made a distinction between symmetric models (which imply equal state frequencies) and asymmetric models (in which the frequencies can be unequal). I mentioned that for the asymmetric model, one possibility is to allow variation in the equilibrium state frequency according to a discrete beta distribution.

Groups 1 and 3 will use a symmetric model in which the frequency of each state is 0.5. This is accomplished in MrBayes by making the beta distribution concentrated in a spike right at the value 0.5.

If you are in groups 1 or 3, type this line:

prset applyto=(1) symdirihyperpr=fixed(infinity) ratepr=variable;

Group 2 will allow more flexibility, allowing the discrete beta distribution to be only slightly mounded around 0.5 by specifying a beta(2,2) distribution.

If you are in group 2, type this line:

prset applyto=(1) symdirihyperpr=fixed(2.0) ratepr=variable;

You may be curious about the ratepr=variable statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. We will be including the ratepr=variable statement in each subset because it is unlikely that any particular subset is evolving at the average rate.

Specifying a model for the 18S data

For the 18S sequences, idealy we would use a secondary structure model, but unfortunately I don't know which sites form stems and loops for these data. Thus, we will have to make do with using a standard DNA sequence model (the GTR+I+G model) for this subset.

The lset command

The lset command is similar to ones you have used before with MrBayes:

 lset applyto=(2) nst=6 rates=invgamma;

The nst=6 tells MrBayes to use a 6-substitution-rate model (i.e. the GTR model) and the rates=invgamma part says to add invariable sites and discrete gamma rate heterogeneity onto the GTR base model. The applyto=(2) statement means that the remainder of the lset command only applies to the second subset (i.e. the one corresponding to the charset named 18S).

The prset command

Priors need to be specified for all the parameters of the GTR+I+G model except branch lengths and the gamma shape parameter. Because these (branch length and gamma shape priors) are being applied to all subsets, we'll specify these later.

prset applyto=(2) revmatpr=dirichlet(1,2,1,1,2,1) 
  statefreqpr=dirichlet(2,2,2,2) pinvarpr=uniform(0,1)
  ratepr=variable;

This specifies a Dirichlet(1,2,1,1,2,1) prior for the 6 GTR relative substitution rate parameters. This prior is not flat, but is vague, and specifies a slight preference for transitions over transversions. For the base frequencies, we're specifying a vague Dirichlet(2,2,2,2) prior. Again, not flat, instead attaching very weak rubber bands to each base frequency, keeping it from straying too far from 0.25. The pinvarpr=uniform(0,1) applies a flat, uniform prior to the pinvar parameter. The applyto=(2) again just says to apply these prior settings only to the second data subset (18S data), and the last ratepr=variable statement adds a relative rate parameter that applies to this entire subset.

Specifying a model for the COI data

The COI gene is protein-coding, which means we can apply a codon model here if we like. Groups 1 and 2 will use a codon model for this subset, while group 3 will try using GTR+I+G for comparison.

Using a codon model implies that the data begin at the first codon position. This wasn't true for the COI data I downloaded from TreeBase (Nylander et al. did not apply codon models, so they didn't have to worry about this). You'll notice that I've commented out the first nucleotide site for all the COI sequences because that represents the third codon position (the first two positions of this codon were not provided). Now the first nucleotide site corresponds to the first codon position of the first codon in the dataset. Another tricky thing here is that this gene does not use the universal genetic code. Instead, it uses the invertebrate mitochondrial code!

An important aside about NCBI resources

How do you find out about these things? The NCBI (National Center for Biotechnology Information) web site is invaluable for these sorts of things. Go to this web site now and try the following: Under Search, choose nucleotide, and in the for box type Synophromorpha (one of the organisms for which we have COI data). After clicking the Go button, you should see three entries, the last of which is for the COI gene sequence. Click on the link labeled AY368911 to bring up the GenBank record for this sequence. At the bottom of the page that appears, you will see the DNA sequence and, just above this, and indented, you will find the amino acid sequence. Note the line that is labeled /transl_table=5. This tells you which genetic code applies. Clicking the link labeled with the number 5 will take you to some information about the invertebrate mitochondrial code (translation table 5) used here. Note also the line labeled /codon_start=2. This tells you that the sequence of codons in the DNA sequence begins at the second site. This is how I knew to comment out the first site.

The lset and prset commands

For groups 1 and 2, use these commands to specify the model for subset 3:

lset applyto=(3) nucmodel=codon nst=2 code=metmt rates=gamma;
prset applyto=(3) tratiopr=beta(2,1) statefreqpr=dirichlet(2) ratepr=variable;

Some of this (the applyto=(3), rates=gamma, and ratepr=variable parts at least) should look familiar by now. Let's go through the remaining statements one by one. The nucmodel=codon says to use a codon model. The nst=2 says to use an HKY-like codon model (one that has a kappa parameter governing the ratio of the transition rate to the transversion rate in addition to the omega parameter that governs the ratio of the nonsynonymous to synonymous rates). The code=metmt says to use the invertebrate mitochondrial genetic code when interpreting codons.

In the prset command, the tratio=beta(2,1) applies a beta(2,1) prior to the kappa parameter. This is again a vague prior that slightly prefers a higher rate of transitions than transversions. The statefreqpr=dirichlet(2) applies a vague Dirichlet prior to the base frequencies (the single 2 is a sort of shorthand for the lazy allowed by MrBayes; we could have specified statefreqpr=Dirichlet(2,2,2,2)). Finally, ratepr=variable allows this subset to have its own relative rate.

If you are in group 3, you should apply the same lset and prset commands we used for subset 2:

lset applyto=(3) nst=6 rates=invgamma;
prset applyto=(3) revmatpr=dirichlet(1,2,1,1,2,1) 
  statefreqpr=dirichlet(2,2,2,2) pinvarpr=uniform(0,1) 
  ratepr=variable;

Specifying branch length and gamma shape priors

The branch length and gamma shape priors apply to all subsets, so we can specify them like this:

prset applyto=(all) brlenspr=unconstrained:exponential(1.0) shapepr=exponential(1.0);

This says to apply an exponential distribution with mean 1 to all branch lengths and to all shape parameters.

Unlinking parameters

By default, MrBayes tries to create the simplest model possible. If you have specified the same model parameters (and the same priors) for more than one subset, MrBayes will "link" these parameters across subsets. This means that it will assume that the values of these linked parameters apply to all subsets involved in the linking. If you want the gamma shape parameter (for example) to take on potentially different values for the morphology, 18S and COI subsets, you have to make sure MrBayes does not automatically link the shape parameter across all three subsets.

 unlink shape=(all) statefreq=(all) revmat=(all);

The above command ensures that different values will be used for the shape parameter, state frequencies and GTR relative rates for all data subsets.

Finishing the MrBayes block

Finish off the MrBayes block by adding an mcmc command and a quit command as follows:

 mcmc ngen=1000000 samplefreq=100 printfreq=100 nchains=1 nruns=1;
 quit;

This says to run a single Markov chain for 1 million generations, saving a sample of the tree and model parameters every 100 generations and showing progress also every 100 generations. We'll peek in on MrBayes after about 10,000 generations but go ahead and specify 1 million generations (if necessary, we can see the results in next week's lab).

Note the use of the mcmc command here rather than the mcmcp command. While it is generally less dangerous to use the mcmcp command in your data file, in this case we must use mcmc because we will be submitting this to the cluster, and the cluster has no way to interact with you after you submit the job using qsub.

Running MrBayes on the cluster

Because the codon model takes a long time to run, set up your run on the Bioinformatics Facility cluster. Below is a short reminder of how to do this; if you need more info, go back to the Garli lab where we first used the cluster.

First, login to bbcxsrv1.biotech.uconn.edu using the PuTTY program.

Second, create (in your home directory) a file named "go" (you can actually use any valid name you like, "go" is simply my personal preference) that looks like this:

#$ -o junk.txt -j y
cd $HOME/runmb
mb nylander.nex

The first line specifies that all output from MrBayes should go to the file junk.txt, which will appear in your home directory. The second line causes MrBayes to be run from within the subdirectory runmb (you'll create that subdirectory next). The third line actually runs MrBayes (mb), specifying nylander.nex as the file to execute.

Third, create a subdirectory named "runmb" under your home directory using the mkdir command:

cd $HOME
mkdir runmb

Fourth, copy your data file (which should be named nylander.nex) into that runmb directory by first double-clicking the PSFTP program icon to start it, then typing the following:

open bbcxsrv1.biotech.uconn.edu
cd runmb
put nylander.nex
quit

Fifth, start your run from your home directory using the qsub command:

cd $HOME
qsub go

You can periodically check the status of your run using

qstat

You can view the output that MrBayes would normally send to the screen by typing (from your home directory)

cat junk.txt

You will find the output files that MrBayes generates in the subdirectory runmb.

What can we learn from this exercise

At this point, I'll fire up the LCD projector and show you some things to look for in the MrBayes output. For example, we should confirm that MrBayes has actually applied the models we asked it to apply to each subset (it is easy to accidentally specify the wrong model when partitioning, so it is important to check this before waiting days for your analysis to finish!).

What we will be most interested in today is getting a feeling (after a very short number of generations) which of our three models is best at explaining the data on average:

Morphology model COI model
Group 1's model symmetric codon
Group 2's model asymmetric codon
Group 3's model symmetric GTR+I+G

We'll simply eyeball the log-likelihoods for each model for this part. Our analyses will not proceed far enough today to do anything more sophisticated.