Difference between revisions of "Phylogenetics: Morphology and Partitioning in MrBayes"

From EEBedia
Jump to: navigation, search
(Setting up the morphology model)
Line 45: Line 45:
 
I have assigned each of you to one of three groups. Each group will try a different model, with the hope that by the end of the day we can learn something about the effects of these models. None of the models we will try were used by Nylander et al. (2004), so we're striking out on our own here. In the material that follows, look out for statments like this: "Groups 1 and 2 should use this line..." and act appropriately given the group to which you were assigned.
 
I have assigned each of you to one of three groups. Each group will try a different model, with the hope that by the end of the day we can learn something about the effects of these models. None of the models we will try were used by Nylander et al. (2004), so we're striking out on our own here. In the material that follows, look out for statments like this: "Groups 1 and 2 should use this line..." and act appropriately given the group to which you were assigned.
  
== Setting up the morphology model ==
+
== Specifying a model for the morphological data ==
  
 
The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.
 
The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.
  
For each subset, you will create an lset and prset command. At the end of your existing mrbayes block, type in this lset command:
+
=== The lset command ===
 +
 
 +
For each subset, you will create an <tt>lset</tt> and <tt>prset</tt> command. At the end of your existing mrbayes block, type in this <tt>lset</tt> command:
  
 
  lset applyto=(1) coding=variable rates=gamma;  
 
  lset applyto=(1) coding=variable rates=gamma;  
  
 
The <tt>applyto=(1)</tt> statement says that these settings apply only to the first (1) subset. The <tt>coding=variable</tt> statement instructs MrBayes to use the likelihood conditional on character variability, rather than the ordinarly likelihood, which assumes that there will be an appropriate number of constant characters present. Finally, the last <tt>rates=gamma</tt> statement says that we would like the rate of evolution to vary from one character to another according to the usual discrete gamma model of among-site rate heterogeneity.
 
The <tt>applyto=(1)</tt> statement says that these settings apply only to the first (1) subset. The <tt>coding=variable</tt> statement instructs MrBayes to use the likelihood conditional on character variability, rather than the ordinarly likelihood, which assumes that there will be an appropriate number of constant characters present. Finally, the last <tt>rates=gamma</tt> statement says that we would like the rate of evolution to vary from one character to another according to the usual discrete gamma model of among-site rate heterogeneity.
 +
 +
=== The prset command ===
  
 
Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the gamma shape parameter. If you remember, in lecture I made a distinction between symmetric models (which imply equal state frequencies) and asymmetric models (in which the frequencies can be unequal). I mentioned that for the asymmetric model, one possibility is to allow variation in the equilibrium state frequency according to a discrete beta distribution.  
 
Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the gamma shape parameter. If you remember, in lecture I made a distinction between symmetric models (which imply equal state frequencies) and asymmetric models (in which the frequencies can be unequal). I mentioned that for the asymmetric model, one possibility is to allow variation in the equilibrium state frequency according to a discrete beta distribution.  
Line 70: Line 74:
  
 
You may be curious about the <tt>ratepr=variable</tt> statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. We will be including the <tt>ratepr=variable</tt> statement in each subset because it is unlikely that any particular subset is evolving at the average rate.
 
You may be curious about the <tt>ratepr=variable</tt> statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. We will be including the <tt>ratepr=variable</tt> statement in each subset because it is unlikely that any particular subset is evolving at the average rate.
 +
 +
== Specifying a model for the 18S data ===
 +
 +
For the 18S sequences, idealy we would use a secondary structure model, but unfortunately I don't know which sites form stems and loops for these data. Thus, we will have to make do with using a standard DNA sequence model (the GTR+I+G model) for this subset.
 +
 +
=== The lset command ===
 +
 +
The lset command is similar to ones you have used before with MrBayes:
 +
 +
  lset applyto=(2) nst=6 rates=invgamma;
 +
 +
The <tt>nst=6</tt> tells MrBayes to use a 6-substitution-rate model (i.e. the GTR model) and the <tt>rates=invgamma</tt> part says to add invariable sites and discrete gamma rate heterogeneity onto the GTR base model. The <tt>applyto=(2)</tt> statement means that the remainder of the lset command only applies to the second subset (i.e. the one corresponding to the charset named 18S).
 +
 +
=== The prset command ===
 +
 +
Priors need to be specified for all the parameters of the GTR+I+G model except branch lengths and the gamma shape parameter. Because these are being applied to all subsets, we'll specify these later.
 +
 +
prset applyto=(2) revmatpr=dirichlet(1,2,1,1,2,1) statefreqpr=dirichlet(2,2,2,2) ratepr=variable;
 +
 +
This specifies a Dirichlet(1,2,1,1,2,1) prior for the 6 GTR relative substitution rate parameters. This prior is not flat, but is vague, and specifies a slight preference for transitions over transversions. For the base frequencies, we're specifying a vague Dirichlet(2,2,2,2) prior. Again, not flat, instead attaching very weak rubber bands to each base frequency, keeping it from straying too far from 0.25. The <tt>applyto=(2)</tt> again just says to apply these prior settings only to the second data subset (18S data), and the last <tt>ratepr=variable</tt> statement adds a relative rate parameter that applies to this entire subset.
  
 
[[Category:EEB courses]]
 
[[Category:EEB courses]]
 
[[Category:Phylogenetics]]
 
[[Category:Phylogenetics]]

Revision as of 02:15, 9 April 2007

Adiantum.png EEB 349: Phylogenetics
The goal of this lab is twofold. First, you will learn how to analyze discrete morphological character data in MrBayes. Second, you will learn how to combine morphological with molecular data in a partitioned analysis in which each data type is assigned an appropriate model of evolution.

The Nylander et al. study

The data for this lab comes from a paper by Nylander et al. (2004) that has already become a landmark study in combining data within a Bayesian framework. The full citation is:

Nylander, J., F. Ronquist, J. P. Huelsenbeck, and J. Nieves-Aldrey. 2004. 
Bayesian phylogenetic analysis of combined data. Systematic Biology 53:47-67.

If you have access, you can download the pdf of this paper.

Downloading the data file

The data from the paper is available in TreeBase in the form of five separate files. While you can download these separately (search by Study Accession number for S970), putting them together into a single MrBayes data file is tricky (for one thing, the sequences are not in the same order in all five files!), so I have done that for you. Download the file by clicking here and save a copy to your local hard drive. This nylander.nex data file contains three of the five data sets analyzed by Nylander et al. (2004). At the end of the file, note that there is the beginnings of a mrbayes block:

begin mrbayes;
 charset morph  = 1-166;
 charset 18S    = 167-1320;
 charset COI    = 1321-2397; 
end;

Each of the three lines already in the mrbayes block defines a charset, a meaningful set of characters. Each charset identifies one of the sources of data used in the Nylander et al. (2004) study. The first charset is named morph. While you can use any name you like to identify charsets, this name is appropriate because these are the discrete morphological characters. The second charset is named 18S because it contains 18S rRNA gene sequences. The third charset is named COI. This is protein-coding gene in the mitochondrion that encodes part of the electron transport chain.

In this lab, you will build on this mrbayes block, but be sure to keep the three charset lines at the top.

Creating a data partition

Your first task is to tell MrBayes how to partition the data. Used as a verb, partition means to erect walls or dividers. The correct term for the data between two partitions (i.e. dividers) is subset, but data subsets are often, confusingly, also called partitions! Even more confusion, the entire collection of subsets is known as a partition! Add the following lines to the end of your mrbayes block to tell MrBayes that you want each of the 3 defined charsets to be a separate component (subset) of the partition:

partition mine = 3:morph,18S,COI;

The first number (3) is the number of subsets composing the partition. The name by which the partition is known in MrBayes is up to you: here I've chosen the name "mine".

Just defining a partition is not enough to get MrBayes to use it! You must tell MrBayes that you want to use the partition named "mine". This seems a little redundant, but the idea is that you can set up several different partitions and then easily turn them on or off to see the effects simply by changing the partition named in the set command:

set partition=mine;

Which group are you in?

I have assigned each of you to one of three groups. Each group will try a different model, with the hope that by the end of the day we can learn something about the effects of these models. None of the models we will try were used by Nylander et al. (2004), so we're striking out on our own here. In the material that follows, look out for statments like this: "Groups 1 and 2 should use this line..." and act appropriately given the group to which you were assigned.

Specifying a model for the morphological data

The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.

The lset command

For each subset, you will create an lset and prset command. At the end of your existing mrbayes block, type in this lset command:

lset applyto=(1) coding=variable rates=gamma; 

The applyto=(1) statement says that these settings apply only to the first (1) subset. The coding=variable statement instructs MrBayes to use the likelihood conditional on character variability, rather than the ordinarly likelihood, which assumes that there will be an appropriate number of constant characters present. Finally, the last rates=gamma statement says that we would like the rate of evolution to vary from one character to another according to the usual discrete gamma model of among-site rate heterogeneity.

The prset command

Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the gamma shape parameter. If you remember, in lecture I made a distinction between symmetric models (which imply equal state frequencies) and asymmetric models (in which the frequencies can be unequal). I mentioned that for the asymmetric model, one possibility is to allow variation in the equilibrium state frequency according to a discrete beta distribution.

Groups 1 and 3 will use a symmetric model in which the frequency of each state is 0.5. This is accomplished in MrBayes by making the beta distribution concentrated in a spike right at the value 0.5.

If you are in groups 1 or 3, type this line:

prset applyto=(1) symdirihyperpr=fixed(infinity) ratepr=variable;

Group 2 will allow more flexibility, allowing the discrete beta distribution to be only slightly mounded around 0.5 by specifying a beta(2,2) distribution.

If you are in group 2, type this line:

prset applyto=(1) symdirihyperpr=fixed(2.0) ratepr=variable;

You may be curious about the ratepr=variable statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. We will be including the ratepr=variable statement in each subset because it is unlikely that any particular subset is evolving at the average rate.

Specifying a model for the 18S data =

For the 18S sequences, idealy we would use a secondary structure model, but unfortunately I don't know which sites form stems and loops for these data. Thus, we will have to make do with using a standard DNA sequence model (the GTR+I+G model) for this subset.

The lset command

The lset command is similar to ones you have used before with MrBayes:

 lset applyto=(2) nst=6 rates=invgamma;

The nst=6 tells MrBayes to use a 6-substitution-rate model (i.e. the GTR model) and the rates=invgamma part says to add invariable sites and discrete gamma rate heterogeneity onto the GTR base model. The applyto=(2) statement means that the remainder of the lset command only applies to the second subset (i.e. the one corresponding to the charset named 18S).

The prset command

Priors need to be specified for all the parameters of the GTR+I+G model except branch lengths and the gamma shape parameter. Because these are being applied to all subsets, we'll specify these later.

prset applyto=(2) revmatpr=dirichlet(1,2,1,1,2,1) statefreqpr=dirichlet(2,2,2,2) ratepr=variable;

This specifies a Dirichlet(1,2,1,1,2,1) prior for the 6 GTR relative substitution rate parameters. This prior is not flat, but is vague, and specifies a slight preference for transitions over transversions. For the base frequencies, we're specifying a vague Dirichlet(2,2,2,2) prior. Again, not flat, instead attaching very weak rubber bands to each base frequency, keeping it from straying too far from 0.25. The applyto=(2) again just says to apply these prior settings only to the second data subset (18S data), and the last ratepr=variable statement adds a relative rate parameter that applies to this entire subset.