Phylogenetics: Mesquite Lab

From EEBedia
Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
The main goal of this lab exercise is to introduce you to the versatile program Mesquite. Mesquite is modular and extensible, which means some of the current functionality has been contributed by people other than the main authors, Wayne and David Maddison. We'll begin exploring Mesquite by using it to simulate (rather than analyze) data. Then, in part B, you will use Mesquite to estimate ancestral states.

Installing Mesquite

Download Mesquite from the Mesquite project download site. Under MacOSX, open the mesquite.dmg file and then drag the Mesquite Folder therein to your Applications folder. To start Mesquite, find Mesquite.app in Applications/Mesquite Folder and double-click it.

Part A: Simulating data

You will use Mesquite today to simulate data matrices with known properties, then use PAUP* to estimate parameters from the simulated data. Why would we want to simulate data? There are a couple of reasons:

  • This exercise will give you a feel for how much data is necessary for attaining a certain degree of precision in parameter estimates. Some parameters (e.g. kappa) require much more data (i.e. more sites) to pin down than other parameters (e.g. base frequencies).
  • Parametric bootstrapping and character mapping, two techniques we will examine later this semester, both require simulating data, so simulation can be important in testing complex hypotheses.

Mesquite requires you to read in a tree with branch lengths and define a substitution model before creating a simulated data matrix. The following tutorial will take you through these steps.

Start Mesquite

The various modules that compose Mesquite are indicated by icons as they are loaded.

Create a tree file

Create a file containing a tree that will serve as the "true tree" for purposes of creating a simulated data set. Copy the following and paste into a new file to create a NEXUS tree file.

#nexus

begin taxa;
 title fake;
 dimensions ntax=5;
 taxlabels A B C D E;
end;

begin trees;
  title truth;
  link taxa = fake;
  tree truetree = (A:0.5,(B:0.05,C:0.05):0.05,(D:0.5,E:0.05):0.05);
 end;  

This file has a couple of nexus commands we have not seen before. The title command simply provides a name to a NEXUS block. The link command is used by Mesquite to connect the information provided in different blocks in various ways. The link and title commands were invented for use by Mesquite, and are not part of the standard set of NEXUS commands, so it is good to be aware that using them may make your NEXUS files unreadable by other programs that read NEXUS files.

Open the file

Read this file into Mesquite using the File > Open File... menu command. You should see a new window appear with the words Taxa "fake" in the title bar.

View the tree

To show the tree that was in this file, use the menu command Taxa&Trees > New tree window. You will be presented a list of choices for a source of trees. You want to see the tree that was in the tree file we just opened, so choose the first item (Stored Trees) and press the Ok button.

Make branch lengths proportional to values in tree file

The tree that appears is not drawn so that branches are proportional to the lengths we supplied. To fix this, choose Branches Proportional to Lengths from the Drawing menu.

Curvograms and other tree forms

Choose Drawing > Tree Form > Curvogram from the menu. Feel free to play with other tree forms.

Line width

Choose Drawing > Line width... and change the width to 2, then press the Ok button.

Define the model that will be used for simulating data

Now you are ready to define a model. Let's create a GTR+I+G model and simulate a data set containing 2,000 sites from it. The first step is to choose Characters > New Character Model > Composite DNA Simulation Model... from the menu above the tree window. Create a name for your new model (e.g. "test model"), then press Ok.

Creating the model

The Edit model dialog box will appear. The different facets (Mesquite calls them submodels) of the model are placed in sections of the dialog box set off by horizontal lines. The first section has to do with base frequencies. If you do nothing, Mesquite will use equal base frequencies both for the root node and the transition probabilities. If you like, you can choose empirical base frequencies for either of these, or you can create your own base frequencies by selecting User-Specified Nucleotide Frequencies... under the drop down list entitled New (there is only one choice in this list, but you must select it to proceed with creating a different base frequency submodel).

You are free to try anything you want here, but note that this tutorial will assume you left both the root states and equilibrium states set to Equal Frequencies.

Specifying rate heterogeneity

The character rates section has to do with among-site rate heterogeneity. Let's put rate heterogeneity into our simulated data using a gamma distribution (but not invariable sites). To do this, we must create a new submodel by selecting Gamma Rates Model... from the drop down list entitled New. Create a name for the new submodel you are creating (e.g. "ASRV submodel"), and then press Ok.

This will bring up the Gamma Rates Model dialog box. Enter a value for the gamma shape parameter (e.g. 0.1) and uncheck the discrete gamma checkbox. By unchecking this checkbox, we are telling Mesquite that we want it to draw a gamma-distributed relative rate separately for each site. Each site will have a different relative rate. If we leave the discrete gamma checkbox checked, Mesquite would draw one of the 4 representative relative rates for each site. In this case, approximately one fourth of the sites would evolve according to the representative rate for the first rate category, one fourth of the sites would evolve according to the representative rate for the second category, and so on. In analyzing data, we always (for practical reasons) approximate the gamma distribution by using the discrete gamma approach, but when simulating data you have more flexibility.

Specifying the rate matrix

Finally, we come to the rate matrix model. Let's use the GTR model for our simulated data. In the drop down list entitled New, choose GTR Rate Matrix Model..., then choose a name (e.g. "GTR submodel") and press the Ok button. You are now presented with the GTR Rate Matrix Model dialog box. Enter a relative rate for each of the 5 substitution classes possible. Mesquite forces the last relative rate (G to T) to be 1, so the other rates you specify will be relative to the GT transversion. I entered, just for fun, 6 for AC, 5 for AG, 4 for AT, 3 for CG and 2 for CT.

Leave the scaling factor set to its default value of 1.0, then press the Ok button to close the Edit model dialog box.

Make it so

You have now specified a tree with branch lengths and a substitution model, so you are ready to simulate. Choose Characters > Make new matrix from > Simulated matrices on Current Tree from the menu attached to the tree window. This will bring up a dialog box asking which kind of model you wish to use. We are simulating DNA data, so choose Evolve DNA Characters and press the Ok button.

Now you are asked to choose a specific DNA character model. You should see the one you defined listed (I called it "test model", but you may have used a different name). Choose the model you defined and press Ok to continue.

Press Ok to confirm use of the current tree for simulating data.

Enter 2000 for the number of characters to simulate, then press Ok.

Enter a name for your simulated data matrix (e.g. "simulated data"), then press Ok.

After a delay, you should see your simulated data matrix appear in a new window. To save this dataset to a nexus file, choose File > Save File As... in the menu attached to your data matrix window. Choose a file name and click Ok.

Analyzing simulated data in PAUP*

Important: do not close Mesquite - you will be simulating more data in a minute, and if you close Mesquite at this point, you will need to start over again from scratch!

Open the file you just saved in PAUP*, set up the GTR+G model using the lset command below

set criterion=likelihood;
lset nst=6 basefreq=equal rates=gamma shape=estimate rmatrix=estimate;

and then use lscores 1 to ask PAUP* to estimate the model parameters.

  • How close was PAUP* able to get to the values you entered for the (among-site rate heterogeneity) shape parameter, or the relative rates of the GTR model?

Save the values reported by the lscores command in a file for later reference. You can copy the values from PAUP*'s output by choosing Edit > Edit Display Buffer from PAUP*'s main menu.

Generating more data

Repeat the exercise, this time creating 10,000 sites rather than only 2,000. This would give PAUP* 5 times more data to work with when estimating parameters. Note that you do not have to start over from scratch: you have already set up the model and tree in Mesquite, you simply need to instructe it to generate more data.

Choose Characters > Make new matrix from > Simulated matrices on Current Tree from the menu attached to the tree window. This will once again bring up a dialog box asking which kind of model you wish to use. Choose Evolve DNA Characters and press the Ok button.

Now you are asked to choose a specific DNA character model. Once again, choose the model you defined and press Ok to continue.

Press Ok to confirm use of the current tree for simulating data.

This time, enter 10000 for the number of characters to simulate, then press Ok.

Enter a different name this time for your simulated data matrix (e.g. "more simulated data"), then press Ok.

After a delay, you should see your simulated data matrix appear in a new window.

Before saving this dataset to a nexus file, it is important to delete the previous data set, which Mesquite still remembers. The problem is that when you save data to a file, Mesquite saves all data it knows about to the file. If you chose File > Save File As... right now, you would end up with a file that contains two data matrices: something that PAUP* will not like. To tell Mesquite to forget about the first (2000-site) data matrix, follow these directions:

  • Find the main Mesquite window. It says Mesquite in large letters at the top, says Mesquite Projects and Files in the title bar, and has a hierarchical diagram the root of which is labeled Current Projects.
  • Now find a brown box labeled something like Character matrix: simulated data
  • Right-click this box, and choose Delete from the popup menu that appears.
  • Click Yes to confirm deletion

Now you can save the file by choosing File > Save File As... from the menu. Follow the directions in the section Analyzing simulated data in PAUP* for analyzing this data set in PAUP*.

  • How do the estimates compare to the true values this time?
  • Which parameters are hardest to nail down? Relative substitution rates or the gamma shape parameter?

Part B: Estimating ancestral states

This week's homework assignment will involve estimating ancestral states. Today you will use Mesquite to do this same excercise. Thus, you will know at least part of the answers even before you start doing the homework!

Create a data set

Create a new nexus data file containing this text:

#NEXUS

begin taxa;
  title fake;
  dimensions ntax=4;
  taxlabels A B C D;
end;

begin trees; 
  title truth;
  link fake;
  tree hw9 = [&U] (A:0.2,B:0.2,(C:0.1,D:0.1):0.1);
end;

begin characters;
  dimensions nchar=2;
  format datatype=standard;
  matrix
    A 01
    B 00
    C 11
    D 11
  ;
end;

Start Mesquite

Exit Mesquite, if you haven't already done so, by closing all windows. Upon closing the last window, you will have to push the Force quit button to get it to actually quit. It is probably not necessary to stop and restart Mesquite, but Mesquite is a complex program, and I tend to start with a clean slate rather than risk having Mesquite remember settings that I have long forgotten about!

Now start it up again and open this newly-created file using the File > Open... menu command. You should see the data matrix appear in a Character Matrix window.

Set up the model

Choose Characters > New character model > Mk1 Model (Markov 1 parameter) from the menu at the top of the Character Matrix window. Choose a name for your model (e.g. "my model").

An Edit model dialog box will appear. Enter 1.0 in the box provided and press the Enter key. The 1 here means that branch lengths will be interpreted in the usual way, as the expected number of substitutions. Mesquite does things a little differently than PAUP*:

  • it displays trees as rooted rather than unrooted
  • it treats branch lengths as representing times only
  • it stores the substitution rate as part of the model rather than getting substitution rates from branch lengths

Mesquite can correctly equate branch lengths with expected number of substitutions if we specify the substitution rate to equal 1.

Set up the tree window

Go back to the Character Matrix window and choose Taxa&Trees > New Tree Window. In the dialog box that appears, choose Stored Trees and press the Ok button.

From the menu at the top of the Tree Window that appears, choose Drawing > Tree Form > Balls & Sticks, and Drawing > Branches Proportional to Lengths.

Estimating ancestral states using likelihood

From the Analysis menu, choose Trace Character History and choose Stored Characters, Likelihood Ancestral States, Stored Probability Model, and "my model" in the succession of dialog boxes that appear.

From the Trace menu, choose Report Likelihoods As > Raw likelihoods, then hover your mouse cursor over the node at the base of the tree. The value associated with state 0 should be shown as 0.02874, and that for state 1 should be 0.01023. Adding these two values together, you get 0.03897, which is the total likelihood for this character (the log-likelihood is ln(0.03897) = -3.24495). The pie diagram at the node thus shows the proportion of the total likelihood associated with each possible state. The state that contributes most to the likelihood is the estimated ancestral state.

The winning threshhold

Technically, state 0 is the winner here, but Mesquite does not consider there to be a clear cut choice in this case. If Mesquite did consider there to be a clear winner, it would adorn that state with an asterisk to indicate that it is a significantly better choice of ancestral state. Mesquite uses a cutoff to determine whether one state is significantly more likely than another state.

The cutoff is applied to the log-likelihood rather than the likelihood. Switch to showing the (negative) log-likelihoods by choosing Trace > Report Likelihoods As > Negative Log Likelihoods. Now state 0 is seen to have a log-likelihood equal to -3.54934 and state 1 has a log-likelihood equal to -4.58274. The difference between these is 1.0334, which is less than the cutoff (2.0) assumed by Mesquite. You can change the cutoff using Trace > Likelihood Decision Threshhold...

Estimating ancestral states using parsimony

You can see what parsimony would say in this case by choosing Analysis > Trace Character History > Stored Characters > Parsimony Ancestral States > Current Parsimony Models. This will create a new Trace Character pane, which will probably be created right on top of your existing one. Use your mouse to move it off the likelihood Trace Character pane, so you can see the two side by side. Note that, unlike likelihood, parsimony has no qualms about choosing 0 as the ancestral state of the basal node for character 1.