Difference between revisions of "Phylogenetics: NEXUS Format"

From EEBedia
Jump to: navigation, search
 
(19 intermediate revisions by one other user not shown)
Line 2: Line 2:
 
|-
 
|-
 
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
 
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span>
+
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
 
|-
 
|-
|The goal of this lab exercise is to show you how to easily create a NEXUS-formatted data file from a set of sequences. The NEXUS format is widely used in phylogenetics, and its basic features are described in the second part of this tutorial.
+
|The goal here is to explain the most important features of the NEXUS file format commonly used in phylogenetics. There are no lab exercises here, just information. Use this as a reference.
 
|}
 
|}
  
== Using PAUP to create a NEXUS data file ==
 
  
First, download the file [{{SERVER}}/people/plewis/courses/phylogenetics/data/angio35.txt angio35.txt] to your hard drive and then upload it to the cluster (instructions in [[Phylogenetics: Bioinformatics Cluster]]).
+
== The Nexus Data File Format ==
  
Now login to the cluster (<tt>bbcxsrv1.biotech.uconn.edu</tt>) and type <tt>paup</tt> to start the PAUP* program.
+
=== Nexus blocks ===
  
<div style="background-color: #ffaaff; padding: 8px;">'''Important!''' Ordinarily, you should not run PAUP* directly like this. Only use this method for extremely short-lived activities. To run an analysis on the cluster, you should use the Sun Grid Engine's <tt>qsub</tt> program to submit a job. Using qsub starts your run on one of the computing nodes (whichever is free at the moment), while typing <tt>paup</tt> starts PAUP* on the head node, which is the node that everyone logs into to start runs. '''Bogging down the head node with a long PAUP* run is the quickest way to lose your cluster privileges!''' That said, what we are doing today will not be computationally demanding and thus should not attract the attention of Jeff Lary (if it does, I will take the blame).</div>
+
PAUP* uses a data file format known as Nexus. This file format is now shared among several programs. Nexus data files always begin with the characters <tt>#nexus</tt> but are otherwise organized into major units known as '''blocks'''. Some blocks are recognized by most of the programs using the Nexus file format, whereas other blocks are private blocks (recognized by only one program). A Nexus block has the following basic structure:
 +
#nexus
 +
...
 +
begin characters;
 +
...
 +
end;
 +
Note that the elipsis (<tt>...</tt>) is never used in a Nexus data file; it is used here simply to indicate that some text has been omitted. The name of the Nexus block used as an example above is <tt>characters</tt>. Because Nexus data files are organized in named blocks, PAUP* and other programs are able to read blocks whose names they recognize and ignore blocks that are not recognized. This allows many different programs to use the same overall format without crashing when they encounter data they cannot interpret.
  
Now type in the following (PAUP) command:
+
=== Nexus commands ===
tonexus from=angio35.txt to=angio35.nex datatype=nucleotide format=text;
+
Blocks are in turn organized into semicolon-terminated '''commands'''. It is very important that you remember to ''terminate all commands with a semicolon''. This is especially hard to remember for very long commands. PAUP* is pretty good about pointing out forgotten semicolons, but sometimes it doesn't realize you've left something out until some distance downstream, which can make the problem point difficult to find. Some common commands will be provided below in the description of the common blocks.
After the conversion, the file <tt>angio35.nex</tt> should be present. Type <tt>quit</tt> to quit PAUP*, then open this Nexus file in the pico editor to see what PAUP* did to convert the original file to Nexus format. (The most important thing PAUP* did was to count the number of nucleotides and set <tt>nchar</tt> for you.)
+
  
Create an assumptions block containing a default exclusion set that excludes the following sites automatically whenever the data file is executed. This should be added to the bottom of the newly-created Nexus file (i.e., after the data). You can use the pico editor for this.
+
=== Nexus comments ===
  begin assumptions;
+
'''Comments''' can be placed in a Nexus file using square brackets. Comments can be placed anywhere, and they are used for many purposes. For example, you can effectively remove some of your data by commenting it out. You can also annotate your sequences using comments. For example, a comment like that below is useful for locating specific sites in your alignment:
   exset * unused = 1-41 234-241 246 506-511 555 681-689 1393-1399 1797-1855 1856-1884 4754-4811;
+
[            ----+--10|----+--20|----+--30|----+--40|----+--50|----]
 +
Ephedra      TTAAGCCATGCATGTCTAAGTATGAACTAATT-CAAACGGTGAAACTGCGGATG
 +
Gnetum      TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
 +
Welwitschia  TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
 +
If you would like your comment printed out in the output when PAUP* executes the data file, just insert an exclamation point (!) as the first character inside the opening left square bracket:
 +
[!This is the data file used for my dissertation]
 +
 
 +
=== Commonly-used Nexus blocks ===
 +
Here is a list of common Nexus blocks and the most-common commands within these blocks. For a complete description of the Nexus file format, take a look at this paper:
 +
 
 +
<div style="background-color: #aaeeee; padding: 5px;">[http://www.jstor.org/stable/2413497 Maddison, David R., Swofford, David L. and Maddison, Wayne P. 1997. NEXUS: an extensible file format for systematic information. Systematic Biology 46: 590-621]</div>
 +
 
 +
==== Taxa block ====
 +
The purpose of a taxa block is to provide names for your taxa (i.e., sequences). You may not use a taxa block very often, since you can also supply names for your taxa directly in the data block (see below). Here is an example of a taxa block.
 +
#nexus
 +
...
 +
  begin taxa;
 +
   dimensions ntax=5;
 +
  taxlabels
 +
    Giardia
 +
    Thermus
 +
    Deinococcus
 +
    Sulfolobus
 +
    Haobacterium
 +
  ;
 
  end;
 
  end;
These numbers represent nucleotide sites that are either missing a lot of data or are difficult to align. The name I gave to this exclusion set is <tt>unused</tt>, but you could name it anything you like. The asterisk tells PAUP* that you want this exset to be applied automatically every time the file is executed.
+
Note that there are four commands in this example of a taxa block. Can you find the terminating semicolon for each of them?
 +
* the <tt>begin</tt> command giving the block's name
 +
* the <tt>dimensions</tt> command giving the number of taxa
 +
* the <tt>taxlabels</tt> command providing the actual taxon labels
 +
* the <tt>end</tt> command, telling PAUP* that there are no more commands to process for this block
  
Create a sets block comprising the following three charset commands:
+
==== Data block ====
* The first charset should be named 18S and include sites 1 through 1855
+
The data block is the workhorse of Nexus blocks. This is where you place the actual sequence data, and, as mentioned above, this can also be where you define the names of your sequences. Here is an example of a data block:
* The second charset should be named rbcL and include sites 1856 through 3283
+
#nexus
* The third charset should be named atpB and include sites 3284 through 4811
+
...
This block should be placed after the assumptions block. Look at the [[#Sets_block|description of the sets block]] and try to do this part on your own.
+
begin data;
 +
  dimensions ntax=5 nchar=54;
 +
  format datatype=dna missing=? gap=-;
 +
  matrix
 +
    Ephedra      TTAAGCCATGCATGTCTAAGTATGAACTAATTCCAAACGGTGAAACTGCGGATG
 +
    Gnetum        TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
 +
    Welwitschia  TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
 +
    Ginkgo        TTAAGCCATGCATGTGTAAGTATGAACTCTTTACAGACTGTGAAACTGCGAATG
 +
    Pinus        TTAAGCCATGCATGTCTAAGTATGAACTAATTGCAGACTGTGAAACTGCGGATG
 +
                  [----+--10|----+--20|----+--30|----+--40|----+--50|----]
 +
  ;
 +
end;
 +
Some things to note in this example are:
 +
* The <tt>dimensions</tt> command comes first in a data block, and specifies the number of sequences (taxa; ntax) and number of sites (characters; nchar).
 +
* The <tt>format</tt> command tells PAUP* what kind of data follow (dna, rna, protein, or standard), and provides the symbols used for missing data (?) and gaps (-).
 +
* The <tt>matrix</tt> command dominates the data block, providing the sequences themselves (as well as the taxon names). Note the semicolon terminating the matrix command!!!
 +
* You can use upper or lower case symbols for nucleotides
 +
* You can place whitespace anywhere except inside a taxon name or keyword (e.g., <tt>data type = dna</tt> would cause problems because <tt>datatype</tt> should not have embedded whitespace).
 +
* If you simply must have a space in one of your taxon names, either use an underscore character in place of the space (e.g., <tt>Ginkgo_biloba</tt>) or surround the taxon name in single quotes (e.g., <tt>'Ginkgo biloba'</tt>). In either case, PAUP* will output the space in its output.
 +
* One item missing from the format command in the example above but which is quite useful is something known as an equate list. The following format statement will cause all occurrences of T to be changed to C and all occurrences of G to be changed to A as the data are being read into PAUP*:
 +
format datatype=dna missing=? gap=- equate="T=C G=A";
 +
This is like telling PAUP* to do a search-and-replace operation on the sequences before reading them in, except that your original file remains intact. Be careful when using equate, because the replacement is case sensitive (i.e., <tt>equate="t=c g=a"</tt> would have had no effect if all the nucleotides are represented by upper case letters!).
 +
* PAUP* recognizes all the standard ambiguity codes (e.g., R for purine, Y for pyrimidine, N for undetermined, etc.).
  
=== Sets block ===
+
==== Trees block ====
The only commands you need to know at this point from a sets block are the charset and the taxset commands.
+
A trees block has the following structure:
 +
#nexus
 +
...
 +
begin trees;
 +
  translate
 +
    1 Ephedra,
 +
    2 Gnetum,
 +
    3 Welwitschia,
 +
    4 Ginkgo,
 +
    5 Pinus
 +
  ;
 +
  tree one = [&U] (1,2,(3,(4,5));
 +
  tree two = [&U] (1,3,(5,(2,4));
 +
end;
 +
Some things to note in this example are:
 +
* The <tt>translate</tt> command provides short alternatives to the taxon names, making the tree descriptions shorter (takes up fewer bytes of disk space).
 +
* the translate command is not necessary however; it is ok to use the taxon names directly in the tree descriptions
 +
* the <tt>tree</tt> command denotes the start of a tree description, which consists of a tree name (e.g., <tt>one</tt> and <tt>two</tt> are used here), followed by an equals sign and then the tree topology in the standard, parenthetical notation (often referred to as the Newick or New Hampshire format).
 +
* The special comments consisting of an ampersand symbol followed by the letter U tell PAUP* to interpret the tree as being an unrooted tree.
 +
* Files containing only the <tt>#nexus</tt> plus a trees block are called tree files
 +
 
 +
==== Sets block ====
 +
The only commands you need to know at this point from a sets block are the <tt>charset</tt> and the <tt>taxset</tt> commands.
 
  #nexus
 
  #nexus
 
  ...
 
  ...
Line 39: Line 115:
 
   taxset gnetales = Ephedra Gnetum Welwitschia;
 
   taxset gnetales = Ephedra Gnetum Welwitschia;
 
  end;
 
  end;
This sets block defines both a set of characters (in this case the sites composing the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., taxset gnetales = 1-3;) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a paup block (see below) or typed directly at the PAUP* command prompt.
+
This sets block defines both a set of characters (in this case the sites composing the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., <tt>taxset gnetales = 1-3;</tt>) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a <tt>paup</tt> block (see below) or typed directly at the PAUP* command prompt.
 +
 
 +
==== Assumptions block ====
 +
There is only one command I will introduce from the assumptions block (although there are a number of others that exist). The <tt>exset</tt> command (the word exset stands for "exclusion set") is useful for creating a set of characters that are automatically excluded whenever the data file is executed. Given the following block:
 +
#nexus
 +
...
 +
begin assumptions;
 +
  exset* badsites = 1 5 47-.;
 +
end;
 +
PAUP* would automatically exclude characters (i.e., sites) 1, 5, and 47 through the end of the sequence. It is the asterisk after the newterm exset that denotes this as the default exclusion set. If you left out the asterisk, PAUP* would define the exclusion set but would not automatically exclude these sites as the data file was being executed.
 +
 
 +
==== Paup block ====
 +
Paup blocks provide a way to give PAUP* commands from within a data file itself. Any command you can type at the command prompt or perform using menu commands you can place in the data file. This allows you to specify an entire analysis right in the data file. For any serious analysis, I always run PAUP* using a paup block. That way I know exactly what I did for a given analysis several days or weeks in the future. Paup blocks are also a handy way to perform certain commands every time the data file is executed. For example, you can set up your favorite likelihood substitution model, delete certain taxa or exclude certain sites from a paup block located just after your data block. Here is an example of a typical paup block:
 +
#nexus
 +
...
 +
begin paup;
 +
  log file=myoutput.txt start stop;
 +
  outgroup Ephedra;
 +
  set criterion=likelihood;
 +
  lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
 +
  hsearch swap=tbr addseq=random nreps=100 start=stepwise;
 +
  describe 1 / plot=phylogram;
 +
  savetrees file=mytrees.tre brlens;
 +
  log stop;
 +
  quit; 
 +
end;
 +
Here is what each line does:
 +
* The <tt>log</tt> command starts a log file (the file will be called myoutput.txt and will be overwritten if it already exists)
 +
* The <tt>outgroup</tt> command specifies that the resulting trees should be rooted between Ephedra and everything else (this just affects the appearance of the tree when drawn)
 +
* The <tt>set</tt> command changes the optimality criterion from the default (parsimony) to maximum likelihood
 +
* The <tt>lset</tt> command sets up PAUP* so that the HKY85 model will be used (number of substitution rates is 2, empirical base frequencies, rates are homogeneous across sites, estimate the transition/transversion ratio, and use the HKY model rather than the other, similar F84 model)
 +
* The <tt>hsearch</tt> command causes PAUP* to conduct 100 heuristic searches (each beginning from a different, random starting tree); each search will start with a stepwise addition tree using random addition of taxa, and this starting tree will be rearranged using the tree bisection/reconnection branch swapping method
 +
* The <tt>describe</tt> command produces a depiction of the tree (rooted at the specified outgroup) on the output (and in the log file, since we opened a log file earlier); the tree will be shown as a phylogram, which means branch lengths will appear proportional to the average number of nucleotide substitutions per site that were inferred for that branch.
 +
* The <tt>savetrees</tt> command saves the best tree found during the search (this is quite important and easy to forget to do!). The <tt>brlens</tt> keyword tells PAUP to save branch length information along with the tree topology.
 +
* The <tt>log</tt> command stops the logging of output to the file myoutput.txt
 +
* The <tt>quit</tt> command causes PAUP* to quit running; if you left out this command, PAUP* would remain running at this point, allowing you to issue other commands
 +
 
 +
Note that because PAUP* ignores blocks whose names it does not recognize, you can easily "comment out" a paup block by simply adding a character to its name. For example, adding an underscore
 +
#nexus
 +
...
 +
begin _paup;
 +
.
 +
.
 +
.
 +
end;
 +
is enough to cause PAUP* to completely ignore this paup block. This is handy because it allows you to create multiple paup blocks for different purposes and turn them off and on whenever you need them.
 +
 
 +
You can also "comment out" a portion of a paup block using the leave command. For example, in this paup block, PAUP* will be set up for doing a likelihood analysis but will not actually conduct the search; the leave command causes PAUP* to exit the block early:
 +
#nexus
 +
...
 +
begin paup;
 +
  log file=myoutput.txt start stop;
 +
  outgroup Ephedra;
 +
  set criterion=likelihood;
 +
  lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
 +
  leave;
 +
  hsearch swap=tbr addseq=random nreps=100 start=stepwise;
 +
  describe 1 / plot=phylogram;
 +
  savetrees file=mytrees.tre brlens;
 +
  log stop;
 +
  quit; 
 +
end;
 +
 
 +
[[Category:Phylogenetics]]

Latest revision as of 18:59, 30 March 2014

Adiantum.png EEB 5349: Phylogenetics
The goal here is to explain the most important features of the NEXUS file format commonly used in phylogenetics. There are no lab exercises here, just information. Use this as a reference.


The Nexus Data File Format

Nexus blocks

PAUP* uses a data file format known as Nexus. This file format is now shared among several programs. Nexus data files always begin with the characters #nexus but are otherwise organized into major units known as blocks. Some blocks are recognized by most of the programs using the Nexus file format, whereas other blocks are private blocks (recognized by only one program). A Nexus block has the following basic structure:

#nexus
...
begin characters;
...
end;

Note that the elipsis (...) is never used in a Nexus data file; it is used here simply to indicate that some text has been omitted. The name of the Nexus block used as an example above is characters. Because Nexus data files are organized in named blocks, PAUP* and other programs are able to read blocks whose names they recognize and ignore blocks that are not recognized. This allows many different programs to use the same overall format without crashing when they encounter data they cannot interpret.

Nexus commands

Blocks are in turn organized into semicolon-terminated commands. It is very important that you remember to terminate all commands with a semicolon. This is especially hard to remember for very long commands. PAUP* is pretty good about pointing out forgotten semicolons, but sometimes it doesn't realize you've left something out until some distance downstream, which can make the problem point difficult to find. Some common commands will be provided below in the description of the common blocks.

Nexus comments

Comments can be placed in a Nexus file using square brackets. Comments can be placed anywhere, and they are used for many purposes. For example, you can effectively remove some of your data by commenting it out. You can also annotate your sequences using comments. For example, a comment like that below is useful for locating specific sites in your alignment:

[            ----+--10|----+--20|----+--30|----+--40|----+--50|----]
Ephedra      TTAAGCCATGCATGTCTAAGTATGAACTAATT-CAAACGGTGAAACTGCGGATG
Gnetum       TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
Welwitschia  TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG

If you would like your comment printed out in the output when PAUP* executes the data file, just insert an exclamation point (!) as the first character inside the opening left square bracket:

[!This is the data file used for my dissertation]

Commonly-used Nexus blocks

Here is a list of common Nexus blocks and the most-common commands within these blocks. For a complete description of the Nexus file format, take a look at this paper:

Maddison, David R., Swofford, David L. and Maddison, Wayne P. 1997. NEXUS: an extensible file format for systematic information. Systematic Biology 46: 590-621

Taxa block

The purpose of a taxa block is to provide names for your taxa (i.e., sequences). You may not use a taxa block very often, since you can also supply names for your taxa directly in the data block (see below). Here is an example of a taxa block.

#nexus
...
begin taxa;
  dimensions ntax=5;
  taxlabels 
    Giardia
    Thermus
    Deinococcus
    Sulfolobus
    Haobacterium
  ;
end;

Note that there are four commands in this example of a taxa block. Can you find the terminating semicolon for each of them?

  • the begin command giving the block's name
  • the dimensions command giving the number of taxa
  • the taxlabels command providing the actual taxon labels
  • the end command, telling PAUP* that there are no more commands to process for this block

Data block

The data block is the workhorse of Nexus blocks. This is where you place the actual sequence data, and, as mentioned above, this can also be where you define the names of your sequences. Here is an example of a data block:

#nexus
...
begin data;
  dimensions ntax=5 nchar=54;
  format datatype=dna missing=? gap=-;
  matrix
    Ephedra       TTAAGCCATGCATGTCTAAGTATGAACTAATTCCAAACGGTGAAACTGCGGATG
    Gnetum        TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
    Welwitschia   TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
    Ginkgo        TTAAGCCATGCATGTGTAAGTATGAACTCTTTACAGACTGTGAAACTGCGAATG
    Pinus         TTAAGCCATGCATGTCTAAGTATGAACTAATTGCAGACTGTGAAACTGCGGATG
                 [----+--10|----+--20|----+--30|----+--40|----+--50|----]
  ;
end;

Some things to note in this example are:

  • The dimensions command comes first in a data block, and specifies the number of sequences (taxa; ntax) and number of sites (characters; nchar).
  • The format command tells PAUP* what kind of data follow (dna, rna, protein, or standard), and provides the symbols used for missing data (?) and gaps (-).
  • The matrix command dominates the data block, providing the sequences themselves (as well as the taxon names). Note the semicolon terminating the matrix command!!!
  • You can use upper or lower case symbols for nucleotides
  • You can place whitespace anywhere except inside a taxon name or keyword (e.g., data type = dna would cause problems because datatype should not have embedded whitespace).
  • If you simply must have a space in one of your taxon names, either use an underscore character in place of the space (e.g., Ginkgo_biloba) or surround the taxon name in single quotes (e.g., 'Ginkgo biloba'). In either case, PAUP* will output the space in its output.
  • One item missing from the format command in the example above but which is quite useful is something known as an equate list. The following format statement will cause all occurrences of T to be changed to C and all occurrences of G to be changed to A as the data are being read into PAUP*:
format datatype=dna missing=? gap=- equate="T=C G=A";

This is like telling PAUP* to do a search-and-replace operation on the sequences before reading them in, except that your original file remains intact. Be careful when using equate, because the replacement is case sensitive (i.e., equate="t=c g=a" would have had no effect if all the nucleotides are represented by upper case letters!).

  • PAUP* recognizes all the standard ambiguity codes (e.g., R for purine, Y for pyrimidine, N for undetermined, etc.).

Trees block

A trees block has the following structure:

#nexus
...
begin trees;
  translate
    1 Ephedra,
    2 Gnetum,
    3 Welwitschia,
    4 Ginkgo,
    5 Pinus
  ;
  tree one = [&U] (1,2,(3,(4,5));
  tree two = [&U] (1,3,(5,(2,4));
end;

Some things to note in this example are:

  • The translate command provides short alternatives to the taxon names, making the tree descriptions shorter (takes up fewer bytes of disk space).
  • the translate command is not necessary however; it is ok to use the taxon names directly in the tree descriptions
  • the tree command denotes the start of a tree description, which consists of a tree name (e.g., one and two are used here), followed by an equals sign and then the tree topology in the standard, parenthetical notation (often referred to as the Newick or New Hampshire format).
  • The special comments consisting of an ampersand symbol followed by the letter U tell PAUP* to interpret the tree as being an unrooted tree.
  • Files containing only the #nexus plus a trees block are called tree files

Sets block

The only commands you need to know at this point from a sets block are the charset and the taxset commands.

#nexus
...
begin sets;
  charset trnL_intron = 562-4226;
  taxset gnetales = Ephedra Gnetum Welwitschia;
end;

This sets block defines both a set of characters (in this case the sites composing the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., taxset gnetales = 1-3;) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a paup block (see below) or typed directly at the PAUP* command prompt.

Assumptions block

There is only one command I will introduce from the assumptions block (although there are a number of others that exist). The exset command (the word exset stands for "exclusion set") is useful for creating a set of characters that are automatically excluded whenever the data file is executed. Given the following block:

#nexus
...
begin assumptions;
  exset* badsites = 1 5 47-.;
end;

PAUP* would automatically exclude characters (i.e., sites) 1, 5, and 47 through the end of the sequence. It is the asterisk after the newterm exset that denotes this as the default exclusion set. If you left out the asterisk, PAUP* would define the exclusion set but would not automatically exclude these sites as the data file was being executed.

Paup block

Paup blocks provide a way to give PAUP* commands from within a data file itself. Any command you can type at the command prompt or perform using menu commands you can place in the data file. This allows you to specify an entire analysis right in the data file. For any serious analysis, I always run PAUP* using a paup block. That way I know exactly what I did for a given analysis several days or weeks in the future. Paup blocks are also a handy way to perform certain commands every time the data file is executed. For example, you can set up your favorite likelihood substitution model, delete certain taxa or exclude certain sites from a paup block located just after your data block. Here is an example of a typical paup block:

#nexus
...
begin paup;
  log file=myoutput.txt start stop;
  outgroup Ephedra;
  set criterion=likelihood;
  lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
  hsearch swap=tbr addseq=random nreps=100 start=stepwise; 
  describe 1 / plot=phylogram;
  savetrees file=mytrees.tre brlens;
  log stop;
  quit;  
end;

Here is what each line does:

  • The log command starts a log file (the file will be called myoutput.txt and will be overwritten if it already exists)
  • The outgroup command specifies that the resulting trees should be rooted between Ephedra and everything else (this just affects the appearance of the tree when drawn)
  • The set command changes the optimality criterion from the default (parsimony) to maximum likelihood
  • The lset command sets up PAUP* so that the HKY85 model will be used (number of substitution rates is 2, empirical base frequencies, rates are homogeneous across sites, estimate the transition/transversion ratio, and use the HKY model rather than the other, similar F84 model)
  • The hsearch command causes PAUP* to conduct 100 heuristic searches (each beginning from a different, random starting tree); each search will start with a stepwise addition tree using random addition of taxa, and this starting tree will be rearranged using the tree bisection/reconnection branch swapping method
  • The describe command produces a depiction of the tree (rooted at the specified outgroup) on the output (and in the log file, since we opened a log file earlier); the tree will be shown as a phylogram, which means branch lengths will appear proportional to the average number of nucleotide substitutions per site that were inferred for that branch.
  • The savetrees command saves the best tree found during the search (this is quite important and easy to forget to do!). The brlens keyword tells PAUP to save branch length information along with the tree topology.
  • The log command stops the logging of output to the file myoutput.txt
  • The quit command causes PAUP* to quit running; if you left out this command, PAUP* would remain running at this point, allowing you to issue other commands

Note that because PAUP* ignores blocks whose names it does not recognize, you can easily "comment out" a paup block by simply adding a character to its name. For example, adding an underscore

#nexus
...
begin _paup;
.
.
.
end;

is enough to cause PAUP* to completely ignore this paup block. This is handy because it allows you to create multiple paup blocks for different purposes and turn them off and on whenever you need them.

You can also "comment out" a portion of a paup block using the leave command. For example, in this paup block, PAUP* will be set up for doing a likelihood analysis but will not actually conduct the search; the leave command causes PAUP* to exit the block early:

#nexus
...
begin paup;
  log file=myoutput.txt start stop;
  outgroup Ephedra;
  set criterion=likelihood;
  lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky; 
  leave;
  hsearch swap=tbr addseq=random nreps=100 start=stepwise; 
  describe 1 / plot=phylogram;
  savetrees file=mytrees.tre brlens;
  log stop;
  quit;  
end;