Phylogenetics: NEXUS Format

From EEBedia
Revision as of 18:14, 21 January 2009 by PaulLewis (Talk | contribs)

Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
The goal of this lab exercise is to show you how to easily create a NEXUS-formatted data file from a set of sequences. The NEXUS format is widely used in phylogenetics, and its basic features are described in the second part of this tutorial.

Using PAUP to create a NEXUS data file

First, download the file angio35.txt to your hard drive and then upload it to the cluster (instructions in Phylogenetics: Bioinformatics Cluster).

Now login to the cluster (bbcxsrv1.biotech.uconn.edu) and type paup to start the PAUP* program.

Important! Ordinarily, you should not run PAUP* directly like this. Only use this method for extremely short-lived activities. To run an analysis on the cluster, you should use the Sun Grid Engine's qsub program to submit a job. Using qsub starts your run on one of the computing nodes (whichever is free at the moment), while typing paup starts PAUP* on the head node, which is the node that everyone logs into to start runs. Bogging down the head node with a long PAUP* run is the quickest way to lose your cluster privileges! That said, what we are doing today will not be computationally demanding and thus should not attract the attention of Jeff Lary (if it does, I will take the blame).

Now type in the following (PAUP) command:

tonexus from=angio35.txt to=angio35.nex datatype=nucleotide format=text;

After the conversion, the file angio35.nex should be present. Type quit to quit PAUP*, then open this Nexus file in the pico editor to see what PAUP* did to convert the original file to Nexus format. (The most important thing PAUP* did was to count the number of nucleotides and set nchar for you.)

Create an assumptions block containing a default exclusion set that excludes the following sites automatically whenever the data file is executed. This should be added to the bottom of the newly-created Nexus file (i.e., after the data). You can use the pico editor for this.

begin assumptions;
  exset * unused = 1-41 234-241 246 506-511 555 681-689 1393-1399 1797-1855 1856-1884 4754-4811;
end;

These numbers represent nucleotide sites that are either missing a lot of data or are difficult to align. The name I gave to this exclusion set is unused, but you could name it anything you like. The asterisk tells PAUP* that you want this exset to be applied automatically every time the file is executed.

Create a sets block comprising the following three charset commands:

  • The first charset should be named 18S and include sites 1 through 1855
  • The second charset should be named rbcL and include sites 1856 through 3283
  • The third charset should be named atpB and include sites 3284 through 4811

This block should be placed after the assumptions block. Look at the [#Sets_block description of the sets block] and try to do this part on your own.

Sets block

The only commands you need to know at this point from a sets block are the charset and the taxset commands.

#nexus
...
begin sets;
  charset trnL_intron = 562-4226;
  taxset gnetales = Ephedra Gnetum Welwitschia;
end;

This sets block defines both a set of characters (in this case the sites composing the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., taxset gnetales = 1-3;) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a paup block (see below) or typed directly at the PAUP* command prompt.