Difference between revisions of "Phylogenetics: Simulating sequence data"

From EEBedia
Jump to: navigation, search
(Goals)
(Saving Simulated Data)
 
(15 intermediate revisions by 2 users not shown)
Line 6: Line 6:
 
|
 
|
 
|}
 
|}
by Kevin Keegan
+
by Kevin Keegan and Paul Lewis
  
 
== Goals ==
 
== Goals ==
Line 20: Line 20:
 
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory.  
 
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory.  
  
====Exploring the NEXUS File====
+
====Simulation Template====
  
 
Create an empty text file and add the following lines to it and save it as a .nex file:
 
Create an empty text file and add the following lines to it and save it as a .nex file:
 
  #nexus
 
  #nexus
  [This example demonstrates the dreaded Felsenstein Zone]
+
   
 
  begin paup;
 
  begin paup;
    cd *;
+
     set storebrlens nowarntsave;
     set storebrlens nostatus autoclose=yes warntree=no notifybeep=no;
+
 
  end;
 
  end;
 +
 
  begin taxa;
 
  begin taxa;
 
     dimensions ntax=4;
 
     dimensions ntax=4;
 
     taxlabels A B C D;
 
     taxlabels A B C D;
 
  end;
 
  end;
 +
 
  begin trees;
 
  begin trees;
     tree 1 = [&R] ((A:1.0,B:1.0):1,(C:1.0,D:1.0):1.0);
+
     tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);
 
  end;
 
  end;
 +
 
  begin dnasim;
 
  begin dnasim;
     simdata nchar=(10 100 1000 10000);
+
     simdata nchar=10000;
     lset model=jc nst=1 basefreq=eq;
+
     lset nst=1 basefreq=equal rates=equal pinvar=0;
    sitemodels jc:1;
+
 
     truetree source=memory treenum=1 showtruetree=brlens;
 
     truetree source=memory treenum=1 showtruetree=brlens;
     beginsim nreps=100 seed=0 monitor=y resultsfile=(name=sim4results.txt replace output=means);
+
     beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);
 
         [parsimony]
 
         [parsimony]
 
             set criterion=parsimony;
 
             set criterion=parsimony;
Line 48: Line 49:
 
         [likelihood under JC]
 
         [likelihood under JC]
 
             set criterion=likelihood;
 
             set criterion=likelihood;
            lset basefreq=equal nst=1;
 
 
             alltrees;
 
             alltrees;
 
             tally 'ML-JC';
 
             tally 'ML-JC';
     endsim;
+
     endsim;  
    set monitor=y;
+
end;
  end;
+
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data.
+
begin paup;
 +
  quit;
 +
end;
 +
The initial <tt>paup</tt> block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.
 +
* ''What is the reason for including <tt>storebrlens</tt>?''
 +
The <tt>taxa</tt> block contains the number of taxa and the names of the taxa.
 +
* ''Why is this taxa block necessary?''
 +
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.
  
* ''What are the names of the taxa in the tree''? {{title|A, B, C, D|answer}}
+
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt>
* ''What's the total distance between taxa A and B''? {{title|1.1 (The branch connecting A to their common ancestor is length 1. The branch connecting B to their common ancestor is 0.1|answer}}
+
  
Notice the <tt>dnasim</tt> block in the NEXUS file. This is where the cutting-edge features are. We are going to use a known phylogeny (<tt>tree 1</tt>) and simulate four matrices of nucelotide data based off of it.
+
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):
 +
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)
 +
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)
 +
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)
  
* ''How many sites are in each of the character matrices''? {{title|10, 100, 1000, and 10000 respectively|answer}}
+
====Execute the NEXUS File====
  
This NEXUS file will tell PAUP* simulate nucleotide data using the Jukes-Cantor model and maximum likelihood. Once it does, it will use each of the data matrices to infer 100 phylogenies using parsimony as the optimality criterion, and 100 phylogenies using the maxmimum likelihood criterion.
+
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file.  
  
* ''Do you expect both of these optimality criteria to produce the same tree''?
+
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.
  
====Execute the NEXUS File====
+
* ''Did both optimality criteria get the tree correct most of the time?'' {{title|Too easy!|answer}}
 +
 
 +
==Enter the Felsenstein Zone==
 +
 
 +
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.
 +
 
 +
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.
 +
 
 +
* ''Did both of optimality criteria produce the same tree''? {{title|Nope|answer}}
 +
 
 +
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.
 +
 
 +
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' {{title|ML because it infers the true tree more often as the amount of data increases|answer}}
  
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. A torrent of simulation output will scroll up your screen too fast for you to comprehend. Good thing you told PAUP* to save the results to the tab-delimited file <tt>sim4results.txt</tt>. <br />
+
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).
  
Exit PAUP*. It will ask if you want to quit despite there being an unsaved tree in memory. Answer yes. The tree in memory is just the last simulated tree. Open the results file in a spreadsheet viewer to see the results.
+
* ''Is ML statistically consistent when the model is violated in this way? Why?''
  
The results file has three columns for each inference method, and rows for each of simulated data matrices. The parsimony_correct and ML-JC_correct columns show the proportion of trees produced under each optimality criterion that represent the true phylogeny (remember you made 100 trees from each matrix).
+
==Saving Simulated Data==
  
* ''Did both of these optimality criteria to produce the same tree''? {{title|They should have!|answer}}
+
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.
* ''Which optimality criterion converged on the true tree the quickest? {{title|Maxmimum likelihood was the fastest|answer}}
+
  
==Submitted for Your Approval: The Felsenstein Zone==
+
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''
 +
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''
 +
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''
  
As you've learned  in lecture, maximum likelihood and parsimony criteria may produce different trees given the same nucleotide data. Edit the true tree, without changing the topological relationships among taxa, such that the parsimony inference groups taxa A and D as sister to each other based off the simulated nucleotide data. Save this file as <tt>paupsimFZ.nex</tt>.<br />
+
==Strimmer and Rambaut (2002) Study==
  
Execute <tt>paupsimFZ.nex</tt>.
+
Download this paper, which I'll refer to as SR from now on:
  
Open the new results in a spreadsheet viewer.
+
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142.
  
* ''Did both of these optimality criteria to produce the same tree''? {{title|As long as you edited the true tree appropriately, nope! Parsimony recovered an incorrect tree with taxa A and D grouped together due to convergence in sequence data|answer}}
+
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result prompted Shimodaira to create the AU (Approximately Unbiased) test.
  
==Have Some Fun==
+
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)?
  
====Time Flies====
+
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with
-lots of time has passed<br />
+
generatetrees all model=equiprobable;
-have them make all of the branch lengths very long and see the nucleotide data
+
lscores all / shtest RELL;
 +
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.
  
====Will this Lab Ever End?====
+
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''
-little time has passed<br />
+
-have them make all of the branch lengths teeny and see the nucleotide data
+
  
====You've...Changed====
+
Change to the AU test and see if that produces a different result.
-lots of evolution has occurred for some taxa but not all<br/>
+
-make branch lengths very long for half of the taxa and look at the nucleotide data
+

Latest revision as of 19:17, 23 February 2018

Adiantum.png EEB 5349: Phylogenetics

by Kevin Keegan and Paul Lewis

Goals

The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is still available (and still as useful as it always was).

Introduction

The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.

Getting Started

We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the help command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the help command to refresh your memory.

Simulation Template

Create an empty text file and add the following lines to it and save it as a .nex file:

#nexus

begin paup;
    set storebrlens nowarntsave;
end;

begin taxa;
    dimensions ntax=4;
    taxlabels A B C D;
end;

begin trees;
    tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);
end;

begin dnasim;
    simdata nchar=10000;
    lset nst=1 basefreq=equal rates=equal pinvar=0;
    truetree source=memory treenum=1 showtruetree=brlens;
    beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);
        [parsimony]
            set criterion=parsimony;
            alltrees;
            tally parsimony;
        [likelihood under JC]
            set criterion=likelihood;
            alltrees;
            tally 'ML-JC';
    endsim; 
end;

begin paup;
  quit;
end;

The initial paup block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.

  • What is the reason for including storebrlens?

The taxa block contains the number of taxa and the names of the taxa.

  • Why is this taxa block necessary?

The trees block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure [&R] says that this tree is rooted.

The beginsim...endsim loop in the dnasim block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The tally commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file results.txt, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing column -t results.txt | less

For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):

  • TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)
  • TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)
  • TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)

Execute the NEXUS File

Log on to the cluster, start a qlogin session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file.

Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the results.txt file in a spreadsheet viewer to see the results.

  • Did both optimality criteria get the tree correct most of the time? answer

Enter the Felsenstein Zone

As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named paupsimFZ.nex. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.

Execute paupsimFZ.nex, then open the new results.txt file in a spreadsheet viewer.

  • Did both of optimality criteria produce the same tree? answer

Change the simdata nchar=10000; line to simdata nchar=(10 100 1000 10000); and change output=allreps to output=meansonly. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.

  • Which (parsimony or ML) appears to be statistically consistent? Why? answer

Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).

  • Is ML statistically consistent when the model is violated in this way? Why?

Saving Simulated Data

Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to interleave the sequence when it displays it. Do this as a parameter of the export function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.

  • Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect
  • Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect
  • Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect

Strimmer and Rambaut (2002) Study

Download this paper, which I'll refer to as SR from now on:

Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142.

SR simulated data on the tree shown in Figure 1 of their paper and expected the SH test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result prompted Shimodaira to create the AU (Approximately Unbiased) test.

Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)?

To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the set criterion=..., alltrees, and tally commands inside the beginsim...endsim loop, replacing these with

generatetrees all model=equiprobable;
lscores all / shtest RELL;

which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say monitor=yes in your beginsim command.

  • How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?

Change to the AU test and see if that produces a different result.