Difference between revisions of "Phylogenetics: Large Scale Maximum Likelihood Analyses"

From EEBedia
Jump to: navigation, search
Line 26: Line 26:
 
==== Editing garli.conf ====
 
==== Editing garli.conf ====
  
You will only need to change two lines. Change this line
+
You will only need to change three lines. Change this line
 
  datafname = zakonEtAl2006.11tax.nex
 
  datafname = zakonEtAl2006.11tax.nex
 
so that it looks like this instead
 
so that it looks like this instead
Line 37: Line 37:
 
  rm -f 50taxa*
 
  rm -f 50taxa*
 
without wiping out your data file as well!
 
without wiping out your data file as well!
 +
Finally, change this line
 +
invariantsites = estimate
 +
so that it looks like this instead
 +
invariantsites = none
 +
This causes GARLI to use the GTR+G model rather than the GTR+I+G model.
  
 
Save the <tt>garli.conf</tt> file when you have made these changes.
 
Save the <tt>garli.conf</tt> file when you have made these changes.
Line 75: Line 80:
 
== Part B: Starting a RAxML run on the cluster ==
 
== Part B: Starting a RAxML run on the cluster ==
  
Another excellent ML program for large problems is [http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm RAxML], written by [http://icwww.epfl.ch/~stamatak/ Alexandros Stamatakis].
+
Another excellent ML program for large problems is [http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm RAxML], written by [http://icwww.epfl.ch/~stamatak/ Alexandros Stamatakis]. This program is exceptionally fast, and has been used to estimate maximum likelihood trees for 25,000 taxa! Let's run RAxML on the same data as GARLI and compare results.
 +
 
 +
=== Preparing the data file ===
 +
 
 +
While GARLI reads NEXUS files, RAxML uses a simpler format. It is easy to use the pico editor to make the necessary changes, however. First, make a copy of your rbcL50.nex file:
 +
cp rbcL50.nex rbcL50.dat
 +
 
 +
Open rbcL50.dat in pico and use Ctrl-K repeatedly to remove these initial lines:
 +
#nexus
 +
 
 +
begin data;
 +
  dimensions ntax=50 nchar=1314;
 +
  format datatype=dna gap=- missing=?;
 +
  matrix
 +
 
 +
Add a new first line to the file that looks like this:
 +
50 1314
 +
 
 +
Now use the down arrow to go to the end of the file and remove the last two lines:
 +
;             
 +
end;
 +
 
 +
Save the file using Ctrl-X and you are ready to run RAxML!

Revision as of 20:33, 16 February 2009

Adiantum.png EEB 349: Phylogenetics
This lab explores two programs (GARLI and RAxML) designed specifically for maximum likelihood analyses on a large scale (hundreds of taxa).

Part A: Starting a GARLI run on the cluster

GARLI is a program written by Derrick Zwickl for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). GARLI now (as of version 0.96) gives you considerable choice in substitution models: GTR[+I][+G] or codon models for nucleotides, plus several choices for amino acids. The genetic algorithm (or GA, for short) search strategy used by GARLI is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as with all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree.

Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but then you only have an hour or so to get this done! Instead of each of us running GARLI several times, we will each run it once and compare notes at the end of the lab.

Preparing the GARLI control file

Like many programs, GARLI uses a control file to specify the settings it will use during a run. Most of the default settings are fine, but you will need to change a few of them before running GARLI.

Obtain a copy of the control file

The first step is to obtain a copy of the GARLI default control file. Go to the GARLI download page and download a version of GARLI appropriate for your platform (Mac or Windows). For now, the only reason you are downloading GARLI is to obtain a copy of the default control file. However, because GARLI is multithreaded, you may find that it is faster to run it on your laptop than on the cluster (assuming your laptop has a multi-core Intel processor). Running on the cluster has advantages, even if it is slower. For one, you don't have to dedicate your laptop to a GARLI run for several hours.

Once you have downloaded and unpacked GARLI on your computer, copy the garli.conf.nuc.defaultSettings to a file named simply garli.conf and open it in your text editor.

Editing garli.conf

You will only need to change three lines. Change this line

datafname = zakonEtAl2006.11tax.nex

so that it looks like this instead

datafname = rbcl50.nex

Then change this line

ofprefix = nuc.GTRIG

so that it looks like this instead

ofprefix = 50taxa

The ofprefix is used by GARLI to begin the name of all output files. I usually use something different than the data file name here. That way, if you eventually want to delete all of the various files that GARLI creates, you can just say

rm -f 50taxa*

without wiping out your data file as well! Finally, change this line

invariantsites = estimate

so that it looks like this instead

invariantsites = none

This causes GARLI to use the GTR+G model rather than the GTR+I+G model.

Save the garli.conf file when you have made these changes.

Log into the cluster

Log into the cluster using the command:

ssh bbcxsrv1.biotech.uconn.edu

Go back to the Phylogenetics: Bioinformatics Cluster lab if you've forgotten some details.

Create a folder and a script for the run

Create a directory named garlirun inside your home directory and use your favorite file transfer method (scp, psftp, Fugu, FileZilla, etc.) to get garli.conf into that directory.

Now download the data file into the garlirun directory:

curl http://hydrodictyon.eeb.uconn.edu/eeb5349/rbcL50.nex > garlirun

Finally, create the script file you will hand to the qsub command to start the run. Use the pico editor to create a file named gogarli in your home directory with the following contents:

#$ -o junk.txt -j y
cd $HOME/garlirun
garli garli.conf

Submit the job

Here is the command to start the job:

qsub gogarli

You should issue this command from your home directory, or where ever you saved the gogarli file.

Check progress every few minutes using the qstat command. This run will take 15 or 20 minutes. If you get bored, you can cd into the garlirun directory and use this command to see the tail end of the log file that GARLI creates automatically:

tail 50taxa.log00.log

The tail command is like the cat command except that it only shows you the last few lines of the file (which often is just what you need).

Part B: Starting a RAxML run on the cluster

Another excellent ML program for large problems is RAxML, written by Alexandros Stamatakis. This program is exceptionally fast, and has been used to estimate maximum likelihood trees for 25,000 taxa! Let's run RAxML on the same data as GARLI and compare results.

Preparing the data file

While GARLI reads NEXUS files, RAxML uses a simpler format. It is easy to use the pico editor to make the necessary changes, however. First, make a copy of your rbcL50.nex file:

cp rbcL50.nex rbcL50.dat

Open rbcL50.dat in pico and use Ctrl-K repeatedly to remove these initial lines:

#nexus 
begin data;
  dimensions ntax=50 nchar=1314;
  format datatype=dna gap=- missing=?;
  matrix

Add a new first line to the file that looks like this:

50 1314

Now use the down arrow to go to the end of the file and remove the last two lines:

;              
end;

Save the file using Ctrl-X and you are ready to run RAxML!