Difference between revisions of "Phylogenetics: Morphology and Partitioning in MrBayes"

From EEBedia
Jump to: navigation, search
(The Nylander et al. study)
Line 17: Line 17:
  
 
== Downloading the data file ==
 
== Downloading the data file ==
The data from the paper is available in [http://www.treebase.org/treebase/index.html TreeBase] in the form of five separate files. While you can download these separately (search by Study Accession number for S970), putting them together into a single MrBayes data file is tricky (for one thing, the sequences are not in the same order in all five files!), so I have done that for you. Download the file by [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/nylander.nex clicking here] and save a copy to your local hard drive.
+
The data from the paper is available in [http://www.treebase.org/treebase/index.html TreeBase] in the form of five separate files. While you can download these separately (search by Study Accession number for S970), putting them together into a single MrBayes data file is tricky (for one thing, the sequences are not in the same order in all five files!), so I have done that for you. Download the file by [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/nylander.nex clicking here] and save a copy to your local hard drive. This nylander.nex data file contains three of the five data sets analyzed by Nylander et al. (2004). At the end of the file, note that there is the beginnings of a mrbayes block:
 +
 
 +
begin mrbayes;
 +
  charset morph  = 1-166;
 +
  charset 18S    = 167-1320;
 +
  charset COI    = 1321-2397;
 +
end;
 +
 
 +
Each of the three lines already in the mrbayes block defines a charset, a meaningful set of characters. Each charset identifies one of the sources of data used in the Nylander et al. (2004) study. The '''first charset''' is named <tt>morph</tt>. While you can use any name you like to identify charsets, this name is appropriate because these are the discrete morphological characters. The '''second charset''' is named <tt>18S</tt> because it contains 18S rRNA gene sequences. The '''third charset''' is named <tt>COI</tt>. This is protein-coding gene in the mitochondrion that encodes part of the electron transport chain.
 +
 
 +
In this lab, you will build on this mrbayes block, but be sure to keep the three charset lines at the top.
 +
 
 +
== Creating a data partition ==
 +
 
 +
Your first task is to tell MrBayes how to partition the data. Used as a verb, partition means to erect walls or dividers. The correct term for the data between two partitions (i.e. dividers) is subset, but data subsets are often, confusingly, also called partitions! Even more confusion, the entire collection of subsets is known as a partition! Add the following lines to the end of your mrbayes block to tell MrBayes that you want each of the 3 defined charsets to be a separate component (subset) of the partition:
 +
 
 +
partition mine = 3:morph,18S,COI;
 +
 
 +
The first number (3) is the number of subsets composing the partition. The name by which the partition is known in MrBayes is up to you: here I've chosen the name "mine".
 +
 
 +
Just defining a partition is not enough to get MrBayes to use it! You must tell MrBayes that you want to use the partition named "mine". This seems a little redundant, but the idea is that you can set up several different partitions and then easily turn them on or off to see the effects simply by changing the partition named in the set command:
 +
 
 +
set partition=mine;
 +
 
 +
== Which group are you in? ==
 +
 
 +
I have assigned each of you to one of three groups. Each group will try a different model, with the hope that by the end of the day we can learn something about the effects of these models. None of the models we will try were used by Nylander et al. (2004), so we're striking out on our own here. In the material that follows, look out for statments like this: "Groups 1 and 2 should use this line..." and act appropriately given the group to which you were assigned.
 +
 
 +
== Setting up the morphology model ==
 +
 
 +
The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.
 +
 
 +
For each subset, you will create an lset and prset command. At the end of your existing mrbayes block, type in this lset command:
 +
 
 +
lset applyto=(1) coding=variable rates=gamma;
 +
 
 +
The <tt>applyto=(1)</tt> statement says that these settings apply only to the first (1) subset. The <tt>coding=variable</tt> statement instructs MrBayes to use the likelihood conditional on character variability, rather than the ordinarly likelihood, which assumes that there will be an appropriate number of constant characters present. Finally, the last <tt>rates=gamma</tt> statement says that we would like the rate of evolution to vary from one character to another according to the usual discrete gamma model of among-site rate heterogeneity.
 +
 
 +
Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the gamma shape parameter. If you remember, in lecture I made a distinction between symmetric models (which imply equal state frequencies) and asymmetric models (in which the frequencies can be unequal). I mentioned that for the asymmetric model, one possibility is to allow variation in the equilibrium state frequency according to a discrete beta distribution.
 +
 
 +
Groups 1 and 3 will use a symmetric model in which the frequency of each state is 0.5. This is accomplished in MrBayes by making the beta distribution concentrated in a spike right at the value 0.5.
 +
 
 +
If you are in groups 1 or 3, type this line:
 +
 +
prset applyto=(1) symdirihyperpr=fixed(infinity) ratepr=variable;
 +
 
 +
Group 2 will allow more flexibility, allowing the discrete beta distribution to be only slightly mounded around 0.5 by specifying a beta(2,2) distribution.
 +
 
 +
If you are in group 2, type this line:
 +
 
 +
prset applyto=(1) symdirihyperpr=fixed(2.0) ratepr=variable;
 +
 
 +
You may be curious about the <tt>ratepr=variable</tt> statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. We will be including the <tt>ratepr=variable</tt> statement in each subset because it is unlikely that any particular subset is evolving at the average rate.
 +
 
 +
[[Category:EEB courses]]
 +
[[Category:Phylogenetics]]

Revision as of 02:02, 9 April 2007

Adiantum.png EEB 349: Phylogenetics
The goal of this lab is twofold. First, you will learn how to analyze discrete morphological character data in MrBayes. Second, you will learn how to combine morphological with molecular data in a partitioned analysis in which each data type is assigned an appropriate model of evolution.

The Nylander et al. study

The data for this lab comes from a paper by Nylander et al. (2004) that has already become a landmark study in combining data within a Bayesian framework. The full citation is:

Nylander, J., F. Ronquist, J. P. Huelsenbeck, and J. Nieves-Aldrey. 2004. 
Bayesian phylogenetic analysis of combined data. Systematic Biology 53:47-67.

If you have access, you can download the pdf of this paper.

Downloading the data file

The data from the paper is available in TreeBase in the form of five separate files. While you can download these separately (search by Study Accession number for S970), putting them together into a single MrBayes data file is tricky (for one thing, the sequences are not in the same order in all five files!), so I have done that for you. Download the file by clicking here and save a copy to your local hard drive. This nylander.nex data file contains three of the five data sets analyzed by Nylander et al. (2004). At the end of the file, note that there is the beginnings of a mrbayes block:

begin mrbayes;
 charset morph  = 1-166;
 charset 18S    = 167-1320;
 charset COI    = 1321-2397; 
end;

Each of the three lines already in the mrbayes block defines a charset, a meaningful set of characters. Each charset identifies one of the sources of data used in the Nylander et al. (2004) study. The first charset is named morph. While you can use any name you like to identify charsets, this name is appropriate because these are the discrete morphological characters. The second charset is named 18S because it contains 18S rRNA gene sequences. The third charset is named COI. This is protein-coding gene in the mitochondrion that encodes part of the electron transport chain.

In this lab, you will build on this mrbayes block, but be sure to keep the three charset lines at the top.

Creating a data partition

Your first task is to tell MrBayes how to partition the data. Used as a verb, partition means to erect walls or dividers. The correct term for the data between two partitions (i.e. dividers) is subset, but data subsets are often, confusingly, also called partitions! Even more confusion, the entire collection of subsets is known as a partition! Add the following lines to the end of your mrbayes block to tell MrBayes that you want each of the 3 defined charsets to be a separate component (subset) of the partition:

partition mine = 3:morph,18S,COI;

The first number (3) is the number of subsets composing the partition. The name by which the partition is known in MrBayes is up to you: here I've chosen the name "mine".

Just defining a partition is not enough to get MrBayes to use it! You must tell MrBayes that you want to use the partition named "mine". This seems a little redundant, but the idea is that you can set up several different partitions and then easily turn them on or off to see the effects simply by changing the partition named in the set command:

set partition=mine;

Which group are you in?

I have assigned each of you to one of three groups. Each group will try a different model, with the hope that by the end of the day we can learn something about the effects of these models. None of the models we will try were used by Nylander et al. (2004), so we're striking out on our own here. In the material that follows, look out for statments like this: "Groups 1 and 2 should use this line..." and act appropriately given the group to which you were assigned.

Setting up the morphology model

The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.

For each subset, you will create an lset and prset command. At the end of your existing mrbayes block, type in this lset command:

lset applyto=(1) coding=variable rates=gamma; 

The applyto=(1) statement says that these settings apply only to the first (1) subset. The coding=variable statement instructs MrBayes to use the likelihood conditional on character variability, rather than the ordinarly likelihood, which assumes that there will be an appropriate number of constant characters present. Finally, the last rates=gamma statement says that we would like the rate of evolution to vary from one character to another according to the usual discrete gamma model of among-site rate heterogeneity.

Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the gamma shape parameter. If you remember, in lecture I made a distinction between symmetric models (which imply equal state frequencies) and asymmetric models (in which the frequencies can be unequal). I mentioned that for the asymmetric model, one possibility is to allow variation in the equilibrium state frequency according to a discrete beta distribution.

Groups 1 and 3 will use a symmetric model in which the frequency of each state is 0.5. This is accomplished in MrBayes by making the beta distribution concentrated in a spike right at the value 0.5.

If you are in groups 1 or 3, type this line:

prset applyto=(1) symdirihyperpr=fixed(infinity) ratepr=variable;

Group 2 will allow more flexibility, allowing the discrete beta distribution to be only slightly mounded around 0.5 by specifying a beta(2,2) distribution.

If you are in group 2, type this line:

prset applyto=(1) symdirihyperpr=fixed(2.0) ratepr=variable;

You may be curious about the ratepr=variable statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. We will be including the ratepr=variable statement in each subset because it is unlikely that any particular subset is evolving at the average rate.