EEB 349: Phylogenetics

Lectures: MW 11-12:15 (CUE 320)
Lab: M 1-3 (TLS 477)
Instructor: Paul O. Lewis
Lab Instructor: Maxi Polihronakis

Lab 1: Introduction to PAUP* and the NEXUS data file format

Contents

Introduction to PAUP* 4.0

PAUP* 4.0 is the successor to PAUP 3.1, which was published in 1993 by David L. Swofford, currently at the School of Computational Science & Information Technology, Florida State University. The name PAUP means Phyogenetic Analysis Using Parsimony because parsimony was the only optimality criterion employed at the time. The asterisk in the name PAUP* means and other methods. PAUP* is one of the most comprehensive phylogenetic analysis computer programs available, and we will spend much of the first half of the semester learning how to use this program.

PAUP* Home Page

The PAUP* Home Page is the best place to go for up-to-date information about program availability, known problems/workarounds, and help in the form of a FAQ and electronic forum. As of this writing, PAUP* is being sold by Sinaur Associates (price varies according to platform). While it is not a free program, you really do get a lot for your money compared to most other commercial software, as the next section is designed to illustrate.

What can PAUP* do?

PAUP* is capable of performing most of the types of phylogenetic analyses you have already performed using other programs (e.g., Puzzle), as well as many more. The following listing is not exhaustive, but is designed to give you an idea of what PAUP* can currently do:

What can PAUP* not do?

Despite its completeness, there are a few things that PAUP* cannot do for you at the present time:

Typographical conventions

In this and subsequent web pages, I will try to stick to the following typographical conventions:

PAUP* tips

PAUP* is not finished at this point. For the most part, this is not a problem since you can purchase and use it just like a finished product. The primary drawback of PAUP*'s unfinished status is that there is currently not a complete manual for the program. On the PAUP* Download Page you can find a PDF command summary and "Quick Start" tutorial; however, much of the explanatory portion of the manual is not present in any form. There are easy ways to obtain information from the program itself, however. Some of the tips listed below are concerned with getting the program to tell you what commands and command options are available.

Here are some tips to keep in mind while you use PAUP*. This list is not comprehensive; these are just some things that are not immediately apparent but which make your life easier once you know about them.

The Nexus Data File Format

Nexus blocks

PAUP* uses a data file format known as Nexus. This file format is now shared among several programs. Nexus data files always begin with the characters #nexus but are otherwise organized into major units known as blocks. Some blocks are recognized by most of the programs using the Nexus file format, whereas other blocks are private blocks (recognized by only one program). A Nexus block has the following basic structure:
#nexus
...
begin characters;
...
end;
Note that the elipsis (...) is never used in a Nexus data file; it is used here simply to indicate that some text has been omitted. The name of the Nexus block used as an example above is characters. Because Nexus data files are organized in named blocks, PAUP* and other programs are able to read blocks whose names they recognize and ignore blocks that are not recognized. This allows many different programs to use the same overall format without crashing when they encounter data they cannot interpret.

Nexus commands

Blocks are in turn organized into semicolon-terminated commands. It is very important that you remember to terminate all commands with a semicolon. This is especially hard to remember for very long commands. PAUP* is pretty good about pointing out forgotten semicolons, but sometimes it doesn't realize you've left something out until some distance downstream, which can make the problem point difficult to find. Some common commands will be provided below in the description of the common blocks.

Nexus comments

Comments can be placed in a Nexus file using square brackets. Comments can be placed anywhere, and they are used for many purposes. For example, you can effectively remove some of your data by commenting it out. You can also annotate your sequences using comments. For example, a comment like that below is useful for locating specific sites in your alignment:

            [----+--10|----+--20|----+--30|----+--40|----+--50|----]
Ephedra      TTAAGCCATGCATGTCTAAGTATGAACTAATT-CAAACGGTGAAACTGCGGATG
Gnetum       TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
Welwitschia  TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
If you would like your comment printed out in the output when PAUP* executes the data file, just insert an exclamation point (!) as the first character inside the opening left square bracket:

[!This is the data file used for my dissertation]

Commonly-used Nexus blocks

Here is a list of common Nexus blocks and the most-common commands within these blocks. For a complete description of the Nexus file format, take a look at this paper:

Maddison, David R., Swofford, David L. and Maddison, Wayne P. 1997. NEXUS: an extensible file format for systematic information. Systematic Biology 46: 590-621

Taxa block

The purpose of a Taxa block is to provide names for your taxa (i.e., sequences). You may not use a Taxa block very often, since you can also supply names for your taxa directly in the Data block (see below). Here is an example of a Taxa block.
#nexus
...
begin taxa;
  dimensions ntax=5;
  taxlabels 
    Giardia
    Thermus
    Deinococcus
    Sulfolobus
    Haobacterium
  ;
end;
Note that there are four commands in this example of a Taxa block. Can you find the terminating semicolon for each of them?

Data block

The Data block is the workhorse of Nexus blocks. This is where you place the actual sequence data, and, as mentioned above, this can also be where you define the names of your sequences. Here is an example of a Data block:
#nexus
...
begin data;
  dimensions ntax=5 nchar=54;
  format datatype=dna missing=? gap=-;
  matrix
    Ephedra       TTAAGCCATGCATGTCTAAGTATGAACTAATTCCAAACGGTGAAACTGCGGATG
    Gnetum        TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
    Welwitschia   TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
    Ginkgo        TTAAGCCATGCATGTGTAAGTATGAACTCTTTACAGACTGTGAAACTGCGAATG
    Pinus         TTAAGCCATGCATGTCTAAGTATGAACTAATTGCAGACTGTGAAACTGCGGATG
                 [----+--10|----+--20|----+--30|----+--40|----+--50|----]
  ;
end;

Some things to note in this example are:

Trees block

A Trees block has the following structure:
#nexus
...
begin trees;
  translate
    1 Ephedra,
    2 Gnetum,
    3 Welwitschia,
    4 Ginkgo,
    5 Pinus
  ;
  tree one = [&U] (1,2,(3,(4,5));
  tree two = [&U] (1,3,(5,(2,4));
end;

Some things to note in this example are:

Sets block

The only commands you need to know at this point from a sets block are the charset and the taxset commands.
#nexus
...
begin sets;
  charset trnL_intron = 562-4226;
  taxset gnetales = Ephedra Gnetum Welwitschia;
end;

This sets block defines both a set of characters (in this case the sites composing the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., taxset gnetales = 1-3;) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a paup block (see below) or typed directly at the PAUP* command prompt.

Assumptions block

There is only one command I will introduce from the assumptions block (although there are a number of others that exist). The exset command (the word exset stands for exclusion set) is useful for creating a set of characters that are automatically excluded whenever the data file is executed. Given the following block:
#nexus
...
begin assumptions;
  exset* badsites = 1 5 47-.;
end;

PAUP* would automatically exclude characters (i.e., sites) 1, 5, and 47 through the end of the sequence. It is the asterisk after the newterm exset that denotes this as the default exclusion set. If you left out the asterisk, PAUP* would define the exclusion set but would not automatically exclude these sites as the data file was being executed.

Paup block

Paup blocks provide a way to give PAUP* commands from within a data file itself. Any command you can type at the command prompt or perform using menu commands you can place in the data file. This allows you to specify an entire analysis right in the data file. For any serious analysis, I always run PAUP* using a paup block. That way I know exactly what I did for a given analysis several days or weeks in the future. Paup blocks are also a handy way to perform certain commands every time the data file is executed. For example, you can set up your favorite likelihood substitution model, delete certain taxa or exclude certain sites from a paup block located just after your data block. Here is an example of a typical paup block:
#nexus
...
begin paup;
  log file=myoutput.txt start stop;
  outgroup Ephedra;
  set criterion=likelihood;
  lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
  hsearch swap=tbr addseq=random nreps=100 start=stepwise; 
  describe 1 / plot=phylogram;
  savetrees file=mytrees.tre brlens;
  log stop;
  quit;  
end;

Here is what each line does (but don't worry too much about this since we will be talking much more about individual commands later in lab):

Note that because PAUP* ignores blocks whose names it does not recognize, you can easily "comment out" a paup block by simply adding a character to its name. For example, adding an underscore

#nexus
...
begin _paup;
.
.
.
end;

is enough to cause PAUP* to completely ignore this paup block. This is handy because it allows you to create multiple paup blocks for different purposes and turn them off and on whenever you need them.

You can also "comment out" a portion of a paup block using the leave command. For example, in this paup block, PAUP* will be set up for doing a likelihood analysis but will not actually conduct the search; the leave command causes PAUP* to exit the block early:

#nexus
...
begin paup;
  log file=myoutput.txt start stop;
  outgroup Ephedra;
  set criterion=likelihood;
  lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky; 
  leave;
  hsearch swap=tbr addseq=random nreps=100 start=stepwise; 
  describe 1 / plot=phylogram;
  savetrees file=mytrees.tre brlens;
  log stop;
  quit;  
end;

Today's lab exercise

First, a note about characters blocks versus data blocks: the characters block is essentially a new and improved version of the data block. Feel free to use either one, but be aware that programs such as PAUP* may eventually stop using the data block since the characters block accomplishes the same thing and has features missing in the data block. To convert a data block to a characters block, just change the block name and add the keyword newtaxa to the dimensions command just before the keyword ntax. This tells PAUP* that you will be defining the names of your taxa in the characters block itself (rather than in a preceding taxa block).

Questions that should be answered (or excercises that you should do on your own) appear in this style. There is no need to turn in your answers to these exercises. It is up to you to make sure you are comfortable with this material. Please ask questions if anything is unclear. While it is possible to do these exercises outside of the scheduled lab time, working through them in lab is better because we are here to help with questions that arise.

  1. First create a folder with a name that is unique (i.e. base it on your name). Everyone is using the same account, so it is important to do everything within your own folder so that you do not interfere with others!
  2. Copy the angio35.txt file from the data folder into your own newly-created folder. (If you are not in the computer lab, you can download the file by right-clicking here and using your browser's Save Target As... menu option)
  3. Start PAUP* but be careful to not execute the angio35.txt file (it is not yet in Nexus format). Do open the file in edit mode (using File > Open... and clicking on the Edit radio button before selecting the name of the file to open) and note that it comprises 35 DNA sequences. These are rbcL gene sequences from various green plants. The important thing to notice is that the format is quite simple: each line consists of a taxon name followed by at least one blank space, which is followed by the sequence for that taxon. Note that the blank space is important: taxon names cannot contain embedded spaces, because spaces are used to separate taxon names from the corresponding sequences.
  4. Now type in the following command:
    tonexus from=angio35.txt to=angio35.nex datatype=nucleotide format=text;

    After the conversion, the file angio35.nex should be present. Open this Nexus file in edit mode to see what PAUP* did to convert the original file to Nexus format. Do not execute the file just yet because there are some additions we need to make before it is ready for analyzing.

  5. Create an assumptions block containing a default exclusion set that excludes the following sites automatically whenever the data file is executed. This should be added to the bottom of the newly-created Nexus file (i.e., after the data). This may be most easily done using PAUP*'s built-in editor, although you may use any editor you choose (just remember to save the file as plain text).
    begin assumptions;
      exset * unused = 1-41 234-241 246 506-511 555 681-689 1393-1399 1797-1855 1856-1884 4754-4811;
    end;

    These numbers represent nucleotide sites that either are missing a lot of data or are difficult to align. The name I gave to this exclusion set is unused, but you could name it anything you like. The asterisk tells PAUP* that you want this exset applied (i.e. you want these sites excluded) every time the file is executed.

  6. Create a sets block comprising the following three charset commands:
    • The first charset should be named 18S and include sites 1 through 1855
    • The second charset should be named rbcL and include sites 1856 through 3283
    • The third charset should be named atpB and include sites 3284 through 4811

    This block should be placed after the assumptions block. Look at the description above of the sets block and try to do this part on your own.

  7. Now execute the data file. Use File | Open... from the main menu to execute your new angio35.nex file. If your assumptions block is correct, the output should include a statement saying that 219 characters have been excluded. If you set up your sets block correctly you should be able to enter this command:
    exclude all;
    include rbcL;

    and get no errors. In addition, PAUP* should tell you that 4592 characters were excluded (as a result of the exclude all command) and 1428 were re-included (as a result of the include rbcL command). For the rest of the exercise, we will be working with the data from all 3 genes, so re-include the 18S and atpB data:

    include 18S atpB;

    PAUP* should now say that there are a total of 4811 included characters.

  8. The first item of business in starting an analysis in PAUP* is to begin logging the output to a file. The following command will begin saving all output to the file output.txt. Note that we have chosen to automatically replace the file if it already exists. If you are nervous about this (and would rather have PAUP* ask before overwriting an existing file), either leave off the replace keyword or substitute append, which tells PAUP* to simply add new output to the end of the file if it already exists.
    log file=output.txt start replace;
  9. Type set ? to get a listing of the general settings. PAUP* has four "settings" commands: set for general settings; pset for settings specifically related to parsimony; lset for settings specifically related to likelihood; and dset for settings specifically related to distance methods.

    From the output of the set command, can you determine which optimality criterion PAUP* would use if we were to do a search at this point?

  10. To perform a parsimony search, first try the alltrees command. This command asks PAUP* to calculate the optimality criterion for every possible tree
    alltrees;

    Did PAUP* allow you to perform an exhaustive search for 35 taxa?

  11. Now try heuristic searching. This approach does not attempt to look at all possible trees, but instead only examines trees that are in the realm of possibility (which can still be a lot of trees!):
    hsearch;

    The search progress will be displayed in a dialog box. When the button says Close rather than Stop, take a look at the numbers summarizing this search. What is the parsimony score of the best tree found during the search? (Write down this score somewhere for later reference.) How many trees were examined (look at # Rearrangements tried)?

  12. Now you probably want to take a look at the tree that PAUP* found and is now holding in memory. First, however, choose an outgroup taxon so that the (unrooted) tree will be drawn in a way that looks like it is rooted in a reasonable place, say between the gymnosperms (first 7 taxa) and angiosperms (remaining taxa):
    outgroup 1-7;
    showtree;

    To make the tree appear to flow downward, which is more pleasing to the eye, tell PAUP* that you would like to use the tree order "right" (this is also commonly known as "ladderizing right"):

    set torder=right;
    showtree;

    Before doing anything else, we should save this tree in a file so that it will be available later, perhaps for viewing or printing in TreeView. Let's call the treefile pars.tre. The brlens keyword in the command below tells PAUP* that you want to save the branch lengths as well as just the tree topology (almost always a good option to include):

    savetrees file=pars.tre brlens;
  13. You may have noticed that PAUP* found 5 most-parsimonious trees. These 5 trees are all indistinguishable using the parsimony criterion. Let's now use the likelihood criterion to evaluate these 5 trees:
    set criterion=likelihood;
    lscores all;

    These commands ask PAUP* to simply evaluate the likelihood score of the trees in memory. Note that because we arrived at these trees using parsimony, it is quite possible that none of these trees represents the maximum likelihood tree. That is, we may be able to find better trees under the likelihood criterion if we performed a search using the likelihood criterion. What is the likelihood score of the best tree? (As for parsimony, write this number down for later comparison.) Is the likelihood score the same for all 5 trees? Which tree is best? Important: PAUP* reports the negative of the natural logarithm of the likelihood score. This means that smaller numbers are better, as smaller numbers represent higher likelihoods.

  14. Next, we will obtain a neighbor-joining tree. Neighbor-joining (NJ for short) is one of the algorithmic methods: that is, it uses an optimality criterion (the minimum evolution criterion) at each step of the algorithm, but in the end produces a tree without actually examining many trees:
    nj;
  15. Let's see how the NJ tree compares to the tree found by parsimony. First, use the lscores command to compute the log-likelihood of the NJ tree:
    lscores all;
    Now compute the parsimony score of the NJ tree using the pscores command:
    pscores all;
    

    According to the parsimony criterion, is the NJ tree better than any of the trees found by parsimony? According to the likelihood criterion, is the NJ tree better than the best tree you have found thus far? Why is it not possible to say definitively whether the NJ tree is better (according to the likelihood criterion) than the maximum likelihood tree?

  16. You may have noticed that PAUP* does not let you copy text from the output window. It will, however, make a copy of the text currently displayed in the output window and put this in an editor window. Chose Edit | Edit Display Buffer from the main menu. You can now cut/copy/paste text from this window to other applications.
  17. That's all for today. The only thing left to do is to close the log file you opened and quit PAUP*:
    log stop;
    quit;