Phylogenetics: Simulating sequence data

From EEBedia
Revision as of 14:13, 22 February 2018 by Paul Lewis (Talk | contribs) (Goals)

Jump to: navigation, search
Adiantum.png EEB 5349: Phylogenetics

by Kevin Keegan


The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is still available (and still as useful as it always was).


The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.

Getting Started

We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the help command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the help command to refresh your memory.

Exploring the NEXUS File

Create an empty text file and add the following lines to it and save it as a .nex file:

[This example demonstrates the dreaded Felsenstein Zone]
begin paup;
    cd *;
    set storebrlens nostatus autoclose=yes warntree=no notifybeep=no;
begin taxa;
    dimensions ntax=4;
    taxlabels A B C D;
begin trees;
    tree 1 = [&R] ((A:1.0,B:1.0):1,(C:1.0,D:1.0):1.0);
begin dnasim;
    simdata nchar=(10 100 1000 10000);
    lset model=jc nst=1 basefreq=eq;
    sitemodels jc:1;
    truetree source=memory treenum=1 showtruetree=brlens;
    beginsim nreps=100 seed=0 monitor=y resultsfile=(name=sim4results.txt replace output=means);
            set criterion=parsimony;
            tally parsimony;
        [likelihood under JC]
            set criterion=likelihood;
            lset basefreq=equal nst=1;
            tally 'ML-JC';
    set monitor=y;

The trees block contains the description of the true tree that we will use to simulate data.

  • What are the names of the taxa in the tree? answer
  • What's the total distance between taxa A and B? answer

Notice the dnasim block in the NEXUS file. This is where the cutting-edge features are. We are going to use a known phylogeny (tree 1) and simulate four matrices of nucelotide data based off of it.

  • How many sites are in each of the character matrices? answer

This NEXUS file will tell PAUP* simulate nucleotide data using the Jukes-Cantor model and maximum likelihood. Once it does, it will use each of the data matrices to infer 100 phylogenies using parsimony as the optimality criterion, and 100 phylogenies using the maxmimum likelihood criterion.

  • Do you expect both of these optimality criteria to produce the same tree?

Execute the NEXUS File

Log on to the cluster, start a qlogin session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. A torrent of simulation output will scroll up your screen too fast for you to comprehend. Good thing you told PAUP* to save the results to the tab-delimited file sim4results.txt.

Exit PAUP*. It will ask if you want to quit despite there being an unsaved tree in memory. Answer yes. The tree in memory is just the last simulated tree. Open the results file in a spreadsheet viewer to see the results.

The results file has three columns for each inference method, and rows for each of simulated data matrices. The parsimony_correct and ML-JC_correct columns show the proportion of trees produced under each optimality criterion that represent the true phylogeny (remember you made 100 trees from each matrix).

  • Did both of these optimality criteria to produce the same tree? answer
  • Which optimality criterion converged on the true tree the quickest? answer

Submitted for Your Approval: The Felsenstein Zone

As you've learned in lecture, maximum likelihood and parsimony criteria may produce different trees given the same nucleotide data. Edit the true tree, without changing the topological relationships among taxa, such that the parsimony inference groups taxa A and D as sister to each other based off the simulated nucleotide data. Save this file as paupsimFZ.nex.

Execute paupsimFZ.nex.

Open the new results in a spreadsheet viewer.

  • Did both of these optimality criteria to produce the same tree? answer

Have Some Fun

Time Flies

-lots of time has passed
-have them make all of the branch lengths very long and see the nucleotide data

Will this Lab Ever End?

-little time has passed
-have them make all of the branch lengths teeny and see the nucleotide data


-lots of evolution has occurred for some taxa but not all
-make branch lengths very long for half of the taxa and look at the nucleotide data