Phylogenetics: Introduction to PAUP*
|EEB 5349: Phylogenetics|
|The goal of this lab exercise is to introduce you to some basic features of PAUP*. You will use PAUP* to create a NEXUS-formatted data file from a set of sequences, then perform analyses of these data using parsimony, likelihood and distance optimality criteria.|
- 1 Using PAUP to create a NEXUS data file
- 2 Automatically excluding sites using a default exset
- 3 Defining sets of sites
- 4 Setting up a log file
- 5 Getting help
- 6 Performing a parsimony search
- 7 Showing and saving trees
- 8 Scoring trees using the likelihood criterion
- 9 Comparing NJ to parsimony and likelihood
- 10 Closing down
Using PAUP to create a NEXUS data file
Now login to the cluster (ssh firstname.lastname@example.org) and type the following command:
srun --partition=general --qos=general --pty bash
This will find a node that is not currently fully loaded with jobs and allow you to play with programs without bogging down the head node (the node everyone uses to access the cluster). You should always use srun if you are planning to do any computation interactively, as we will do today.
Once the system has moved you to a compute node, type the following commands to start the PAUP* program:
module load paup/4.0a-166 paup
Now type in the following (PAUP) command:
tonexus from=angio35.txt to=angio35.nex datatype=nucleotide format=text;
After the conversion, the file angio35.nex should be present. Type quit to quit PAUP*, then open this Nexus file in the nano editor to see what PAUP* did to convert the original file to Nexus format. (The most important thing PAUP* did was to count the number of nucleotides and set nchar for you.)
Automatically excluding sites using a default exset
Create an assumptions block containing a default exclusion set that excludes the following sites automatically whenever the data file is executed. This should be added to the bottom of the newly-created Nexus file (i.e., after the data). You can use the nano editor for this.
begin assumptions; exset * unused = 1-41 234-241 246 506-511 555 681-689 1393-1399 1797-1855 1856-1884 4754-4811; end;
These numbers represent nucleotide sites that are either missing a lot of data or are difficult to align. The name I gave to this exclusion set is unused, but you could name it anything you like. The asterisk tells PAUP* that you want this exset to be applied automatically every time the file is executed.
Defining sets of sites
Create a sets block comprising the following three charset commands:
- The first charset should be named 18S and include sites 1 through 1855
- The second charset should be named rbcL and include sites 1856 through 3283
- The third charset should be named atpB and include sites 3284 through 4811
This block should be placed after the assumptions block. Look at the description of the sets block and try to do this part on your own.
After saving the file in nano, start paup specifying the data file name on the command line:
If your assumptions block is correct, the output should include a statement saying that 219 characters have been excluded. If you set up your sets block correctly you should now be able to enter this command:
exclude all; include rbcL;
and get no errors. In addition, PAUP* should tell you that 4592 characters were excluded (as a result of the exclude all command) and 1428 were re-included (as a result of the include rbcL command). For the rest of the exercise, we will be working with the data from all 3 genes, so re-include the 18S and atpB data:
include 18S atpB;
PAUP* should now say that there are a total of 4811 included characters.
Setting up a log file
The first item of business in starting an analysis in PAUP* is to begin logging the output to a file. The following command will begin saving all output to the file output.txt. Note that we have chosen to automatically replace the file if it already exists. If you are nervous about this (and would rather have PAUP* ask before overwriting an existing file), either leave off the replace keyword or substitute append, which tells PAUP* to simply add new output to the end of the file if it already exists.
log file=output.txt start replace;
Type set ? to get a listing of the general settings. PAUP* has four "settings" commands: set for general settings; pset for settings specifically related to parsimony; lset for settings specifically related to likelihood; and dset for settings specifically related to distance methods.
Performing a parsimony search
To perform a parsimony search, first try the alltrees command. This command asks PAUP* to calculate the optimality criterion for every possible tree:
Now try heuristic searching. This approach does not attempt to look at all possible trees, but instead only examines trees that are in the realm of possibility (which can still be a lot of trees!):
Showing and saving trees
Now you probably want to take a look at the tree that PAUP* found and is now holding in memory. First, however, choose an outgroup taxon so that the (unrooted) tree will be drawn in a way that looks like it is rooted in a reasonable place, say between the gymnosperms (first 7 taxa) and angiosperms (remaining taxa):
outgroup 1-7; showtree;
To make the tree appear to flow downward, which is more pleasing to the eye, tell PAUP* that you would like to use the tree order "right" (this is also commonly known as "ladderizing right"):
set torder=right; showtree;
Before doing anything else, we should save this tree in a file so that it will be available later, perhaps for viewing or printing in FigTree. Let's call the treefile pars.tre. The brlens keyword in the command below tells PAUP* that you want to save the branch lengths as well as just the tree topology (almost always a good option to include):
savetrees file=pars.tre brlens;
Scoring trees using the likelihood criterion
You may have noticed that PAUP* found 5 most-parsimonious trees. These 5 trees are all indistinguishable using the parsimony criterion. Let's now use the likelihood criterion to evaluate these 5 trees:
set criterion=likelihood; lscores all;
These commands ask PAUP* to simply evaluate the likelihood score of the trees in memory. Note that because we arrived at these trees using parsimony, it is quite possible that none of these trees represents the maximum likelihood tree. That is, we may be able to find better trees under the likelihood criterion if we performed a search using the likelihood criterion.
Important: PAUP* reports the negative of the natural logarithm of the likelihood score. This means that smaller numbers are better, as smaller numbers represent higher likelihoods. Why does PAUP* do this? For the sake of consistency: in PAUP*, the minimum score is always the best score. You should always report log-likelihoods as negative numbers, however.
Comparing NJ to parsimony and likelihood
Next, we will obtain a neighbor-joining tree. Neighbor-joining (NJ for short) is one of the algorithmic methods: that is, it uses an optimality criterion (the minimum evolution criterion) at each step of the algorithm, but in the end produces an estimate of the phylogeny without actually examining many trees:
Let's see how the NJ tree compares to the tree found by parsimony. First, use the lscores command to compute the log-likelihood of the NJ tree:
Now compute the parsimony score of the NJ tree using the pscores command:
That's all for today. The only thing left to do is to close the log file you opened and quit PAUP*:
log stop; quit;
You can use Ctrl-d (or type exit) to quit your session on the cluster.