Phylogenetics: Mesquite Lab

From EEBedia
Revision as of 18:40, 24 March 2007 by PaulLewis (Talk | contribs) (Generating more data)

Jump to: navigation, search
Under construction.png This article is still under construction.
Expect it to change frequently until this notice is removed.
Adiantum.png EEB 349: Phylogenetics
The goal of this lab exercise is to introduce you to the versatile program Mesquite. Mesquite is modular and extensible, which means some of the current functionality has been contributed by people other than the main authors, Wayne and David Maddison. We'll begin exploring Mesquite by using it to simulate (rather than analyze) data.

You will use Mesquite today to simulate data matrices with known properties, then use PAUP* to estimate parameters from the simulated data. Why would we want to simulate data? There are a couple of reasons:

  • This exercise will give you a feel for how much data is necessary for attaining a certain degree of precision in parameter estimates. Some parameters (e.g. kappa) require much more data (i.e. more sites) to pin down than other parameters (e.g. base frequencies).
  • Parametric bootstrapping and character mapping, two techniques we will examine later this semester, both require simulating data, so simulation can be important in testing complex hypotheses.

Mesquite requires you to read in a tree with branch lengths and define a substitution model before creating a simulated data matrix. The following tutorial will take you through these steps.

Start Mesquite

The various modules that compose Mesquite are indicated by icons as they are loaded.

Create a tree file

Create a file containing a tree that will serve as the "true tree" for purposes of creating a simulated data set. Copy the following and paste into a new file to create a NEXUS tree file.

#nexus

begin taxa;
 title fake;
 dimensions ntax=5;
 taxlabels A B C D E;
end;

begin trees;
  title truth;
  link taxa = fake;
  tree truetree = (A:0.5,(B:0.05,C:0.05):0.05,(D:0.5,E:0.05):0.05);
 end;  

This file has a couple of nexus commands we have not seen before. The title command simply provides a name to a NEXUS block. The link command is used by Mesquite to connect the information provided in different blocks in various ways. The link and title commands were invented for use by Mesquite, and are not part of the standard set of NEXUS commands, so it is good to be aware that using them may make your NEXUS files unreadable by other programs that read NEXUS files.

Open the file

Read this file into Mesquite using the File > Open File... menu command. You should see a new window appear with the words Taxa "fake" in the title bar.

View the tree

To show the tree that was in this file, use the menu command Taxa&Trees > New tree window. You will be presented a list of choices for a source of trees. You want to see the tree that was in the tree file we just opened, so choose the first item (Stored Trees) and press the Ok button.

Make branch lengths proportional to values in tree file

The tree that appears is not drawn so that branches are proportional to the lengths we supplied. To fix this, choose Branches Proportional to Lengths from the Drawing menu.

Curvograms and other tree forms

Choose Drawing > Tree Form > Curvogram from the menu. Feel free to play with other tree forms.

Line width

Choose Drawing > Line width... and change the width to 2, then press the Ok button.

Define the model that will be used for simulating data

Now you are ready to define a model. Let's create a GTR+I+G model and simulate a data set containing 2,000 sites from it. The first step is to choose Characters > New Character Model > Composite DNA Simulation Model... from the menu above the tree window. Create a name for your new model (e.g. "test model"), then press Ok.

Creating the model

The Edit model dialog box will appear. The different facets (Mesquite calls them submodels) of the model are placed in sections of the dialog box set off by horizontal lines. The first section has to do with base frequencies. If you do nothing, Mesquite will use equal base frequencies both for the root node and the transition probabilities. If you like, you can choose empirical base frequencies for either of these, or you can create your own base frequencies by selecting User-Specified Nucleotide Frequencies... under the drop down list entitled New (there is only one choice in this list, but you must select it to proceed with creating a different base frequency submodel).

You are free to try anything you want here, but note that this tutorial will assume you left both the root states and equilibrium states set to Equal Frequencies.

Specifying rate heterogeneity

The character rates section has to do with among-site rate heterogeneity. Let's put rate heterogeneity into our simulated data using a gamma distribution (but not invariable sites). To do this, we must create a new submodel by selecting Gamma Rates Model... from the drop down list entitled New. Create a name for the new submodel you are creating (e.g. "ASRV submodel"), and then press Ok.

This will bring up the Gamma Rates Model dialog box. Enter a value for the gamma shape parameter (e.g. 0.1) and uncheck the discrete gamma checkbox. By unchecking this checkbox, we are telling Mesquite that we want it to draw a gamma-distributed relative rate separately for each site. Each site will have a different relative rate. If we leave the discrete gamma checkbox checked, Mesquite would draw one of the 4 representative relative rates for each site. In this case, approximately one fourth of the sites would evolve according to the representative rate for the first rate category, one fourth of the sites would evolve according to the representative rate for the second category, and so on. In analyzing data, we always (for practical reasons) approximate the gamma distribution by using the discrete gamma approach, but when simulating data you have more flexibility.

Specifying the rate matrix

Finally, we come to the rate matrix model. Let's use the GTR model for our simulated data. In the drop down list entitled New, choose GTR Rate Matrix Model..., then choose a name (e.g. "GTR submodel") and press the Ok button. You are now presented with the GTR Rate Matrix Model dialog box. Enter a relative rate for each of the 5 substitution classes possible. Mesquite forces the last relative rate (G to T) to be 1, so the other rates you specify will be relative to the GT transversion. I entered, just for fun, 6 for AC, 5 for AG, 4 for AT, 3 for CG and 2 for CT.

Leave the scaling factor set to its default value of 1.0, then press the Ok button to close the Edit model dialog box.

Make it so

You have now specified a tree with branch lengths and a substitution model, so you are ready to simulate. Choose Characters > Make new matrix from > Simulated matrices on Current Tree from the menu attached to the tree window. This will bring up a dialog box asking which kind of model you wish to use. We are simulating DNA data, so choose Evolve DNA Characters and press the Ok button.

Now you are asked to choose a specific DNA character model. You should see the one you defined listed (I called it "test model", but you may have used a different name). Choose the model you defined and press Ok to continue.

Press Ok to confirm use of the current tree for simulating data.

Enter 2000 for the number of characters to simulate, then press Ok.

Enter a name for your simulated data matrix (e.g. "simulated data"), then press Ok.

After a delay, you should see your simulated data matrix appear in a new window. To save this dataset to a nexus file, choose File > Save File As... in the menu attached to your data matrix window. Choose a file name and click Ok.

Analyzing simulated data in PAUP*

Important: do not close Mesquite - you will be simulating more data in a minute, and if you close Mesquite at this point, you will need to start over again from scratch!

Open the file you just saved in PAUP*, set up the GTR+G model using the lset command below

set criterion=likelihood;
lset nst=6 basefreq=equal rates=gamma shape=estimate rmatrix=estimate;

and then use lscores 1 to ask PAUP* to estimate the model parameters. How close was PAUP* able to get to the values you entered for the (among-site rate heterogeneity) shape parameter, or the relative rates of the GTR model?

Generating more data

Repeat the exercise, this time creating 10,000 sites rather than only 2,000. This would give PAUP* 5 times more data to work with when estimating parameters. Note that you do not have to start over from scratch: you have already set up the model and tree in Mesquite, you simply need to instructed it to generate more data. Start this time with the section entitled #Make_it_so