Phylogenetics: Introduction to PAUP*

From EEBedia
Revision as of 03:01, 19 January 2014 by Paul Lewis (Talk | contribs)

Jump to: navigation, search
Adiantum.png EEB 5349: Phylogenetics
The goal of this lab exercise is to introduce you to some basic features of PAUP*. You will use PAUP* to create a NEXUS-formatted data file from a set of sequences, then perform analyses of these data using parsimony, likelihood and distance optimality criteria.

Using PAUP to create a NEXUS data file

First, download the file angio35.txt to your hard drive and then upload it to the cluster (instructions in Phylogenetics: Bioinformatics Cluster).

Now login to the cluster ( and type paup to start the PAUP* program.

Important! Ordinarily, you should not run PAUP* directly like this. Only use this method for extremely short-lived activities. To run an analysis on the cluster, you should use the Sun Grid Engine's qsub program to submit a job. Using qsub starts your run on one of the computing nodes (whichever is free at the moment), while typing paup starts PAUP* on the head node, which is the node that everyone logs into to start runs. Bogging down the head node with a long PAUP* run is the quickest way to lose your cluster privileges! That said, what we are doing today will not be computationally demanding and thus should not attract the attention of Jeff Lary (if it does, I will take the blame).

Now type in the following (PAUP) command:

tonexus from=angio35.txt to=angio35.nex datatype=nucleotide format=text;

After the conversion, the file angio35.nex should be present. Type quit to quit PAUP*, then open this Nexus file in the pico editor to see what PAUP* did to convert the original file to Nexus format. (The most important thing PAUP* did was to count the number of nucleotides and set nchar for you.)

Automatically excluding sites using a default exset

Create an assumptions block containing a default exclusion set that excludes the following sites automatically whenever the data file is executed. This should be added to the bottom of the newly-created Nexus file (i.e., after the data). You can use the pico editor for this.

begin assumptions;
  exset * unused = 1-41 234-241 246 506-511 555 681-689 1393-1399 1797-1855 1856-1884 4754-4811;

These numbers represent nucleotide sites that are either missing a lot of data or are difficult to align. The name I gave to this exclusion set is unused, but you could name it anything you like. The asterisk tells PAUP* that you want this exset to be applied automatically every time the file is executed.

Defining sets of sites

Create a sets block comprising the following three charset commands:

  • The first charset should be named 18S and include sites 1 through 1855
  • The second charset should be named rbcL and include sites 1856 through 3283
  • The third charset should be named atpB and include sites 3284 through 4811

This block should be placed after the assumptions block. Look at the description of the sets block and try to do this part on your own.

Now execute the data file:

exe angio35.nex

If your assumptions block is correct, the output should include a statement saying that 219 characters have been excluded. If you set up your sets block correctly you should now be able to enter this command:

exclude all;
include rbcL;

and get no errors. In addition, PAUP* should tell you that 4592 characters were excluded (as a result of the exclude all command) and 1428 were re-included (as a result of the include rbcL command). For the rest of the exercise, we will be working with the data from all 3 genes, so re-include the 18S and atpB data:

include 18S atpB;

PAUP* should now say that there are a total of 4811 included characters.

Setting up a log file

The first item of business in starting an analysis in PAUP* is to begin logging the output to a file. The following command will begin saving all output to the file output.txt. Note that we have chosen to automatically replace the file if it already exists. If you are nervous about this (and would rather have PAUP* ask before overwriting an existing file), either leave off the replace keyword or substitute append, which tells PAUP* to simply add new output to the end of the file if it already exists.

log file=output.txt start replace;

Getting help

Type set ? to get a listing of the general settings. PAUP* has four "settings" commands: set for general settings; pset for settings specifically related to parsimony; lset for settings specifically related to likelihood; and dset for settings specifically related to distance methods.

From the output of the set command, can you determine which optimality criterion PAUP* would use if we were to do a search at this point?

Performing a parsimony search

To perform a parsimony search, first try the alltrees command. This command asks PAUP* to calculate the optimality criterion for every possible tree:

What was PAUP*'s response to this request?

Now try heuristic searching. This approach does not attempt to look at all possible trees, but instead only examines trees that are in the realm of possibility (which can still be a lot of trees!):

What is the parsimony score of the best tree found during the search? (Write down this score somewhere for later reference.) How many trees were examined (look at # Rearrangements tried)?

Showing and saving trees

Now you probably want to take a look at the tree that PAUP* found and is now holding in memory. First, however, choose an outgroup taxon so that the (unrooted) tree will be drawn in a way that looks like it is rooted in a reasonable place, say between the gymnosperms (first 7 taxa) and angiosperms (remaining taxa):

outgroup 1-7;

To make the tree appear to flow downward, which is more pleasing to the eye, tell PAUP* that you would like to use the tree order "right" (this is also commonly known as "ladderizing right"):

set torder=right;

Before doing anything else, we should save this tree in a file so that it will be available later, perhaps for viewing or printing in TreeView. Let's call the treefile pars.tre. The brlens keyword in the command below tells PAUP* that you want to save the branch lengths as well as just the tree topology (almost always a good option to include):

savetrees file=pars.tre brlens;

Scoring trees using the likelihood criterion

You may have noticed that PAUP* found 5 most-parsimonious trees. These 5 trees are all indistinguishable using the parsimony criterion. Let's now use the likelihood criterion to evaluate these 5 trees:

set criterion=likelihood;
lscores all;

These commands ask PAUP* to simply evaluate the likelihood score of the trees in memory. Note that because we arrived at these trees using parsimony, it is quite possible that none of these trees represents the maximum likelihood tree. That is, we may be able to find better trees under the likelihood criterion if we performed a search using the likelihood criterion.

What is the likelihood score of the best tree? (As for parsimony, write this number down for later comparison.) Is the likelihood score the same for all 5 trees? Which tree is best?

Important: PAUP* reports the negative of the natural logarithm of the likelihood score. This means that smaller numbers are better, as smaller numbers represent higher likelihoods. Why does PAUP* do this? For the sake of consistency: in PAUP*, the minimum score is always the best score. You should always report log-likelihoods as negative numbers, however.

Comparing NJ to parsimony and likelihood

Next, we will obtain a neighbor-joining tree. Neighbor-joining (NJ for short) is one of the algorithmic methods: that is, it uses an optimality criterion (the minimum evolution criterion) at each step of the algorithm, but in the end produces an estimate of the phylogeny without actually examining many trees:


Let's see how the NJ tree compares to the tree found by parsimony. First, use the lscores command to compute the log-likelihood of the NJ tree:

lscores all;

Now compute the parsimony score of the NJ tree using the pscores command:

pscores all;
According to the parsimony criterion, is the NJ tree better than any of the trees found by parsimony? According to the likelihood criterion, is the NJ tree better than the best tree you have found thus far? Is it possible to say definitively whether the NJ tree is better or worse (according to the likelihood criterion) than the maximum likelihood tree?

Closing down

That's all for today. The only thing left to do is to close the log file you opened and quit PAUP*:

log stop;

You can use Ctrl-d (or type exit) to quit your session on the cluster.

Sets block

The only commands you need to know at this point from a sets block are the charset and the taxset commands.

begin sets;
  charset trnL_intron = 562-4226;
  taxset gnetales = Ephedra Gnetum Welwitschia;

This sets block defines both a set of characters (in this case the sites composing the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., taxset gnetales = 1-3;) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a paup block (see below) or typed directly at the PAUP* command prompt.