Phylogenetics: Large Scale Maximum Likelihood Analyses

From EEBedia
Revision as of 13:34, 16 February 2009 by PaulLewis (Talk | contribs)

Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
This lab explores two programs (GARLI and RAxML) designed specifically for maximum likelihood analyses on a large scale (hundreds of taxa).

Part A: Starting a GARLI run on the cluster

GARLI is a program written by Derrick Zwickl for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). I used GARLI to estimate the 738-taxon green plant phylogeny on the poster outside my office door in less than 5 hours. Another excellent ML program for large problems is RAxML, written by Alexandros Stamatakis.

GARLI does not give you much choice in the way of search strategy or substitution model. It uses the GTR+I+G model (General Time Reversible substitution model, with invariable sites and discrete gamma rate heterogeneity), and uses a genetic algorithm search strategy. The genetic algorithm (or GA, for short) search strategy is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as will all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree.

Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but then you only have an hour or so to get this done!

Preparing the GARLI control file

Like many non-interactive programs, GARLI uses a control file to specify the settings it will use during a run. Here is the control file distributed with GARLI:

datafname = rana.phy
streefname = random
ofprefix = ranaGarli
randseed = -1
megsclamemory = 500
availablememory = 512
logevery = 10
saveevery = 100
refinestart = 1
outputeachbettertopology = 1
enforcetermconditions = 1
genthreshfortopoterm = 20000
scorethreshforterm = .05
significanttopochange = 0.05
outputphyliptree = 0
outputmostlyuselessfiles = 0
dontinferproportioninvariant = 0  

nindivs = 4
holdover = 1
selectionintensity = .5
holdoverpenalty = 0
stopgen = 5000000
stoptime = 5000000

startoptprec = .5
minoptprec = .01
numberofprecreductions = 40
topoweight = 1.0
modweight = .05
brlenweight = .2
randnniweight = .2
randsprweight = .3
limsprweight =  .5
intervallength = 100
intervalstostore = 5

limsprrange = 6
meanbrlenmuts = 5
gammashapebrlen = 1000
gammashapemodel = 1000 

bootstrapreps = 0
inferinternalstateprobs = 0

Most of these settings are fine, but you will need to change a few of them before running GARLI.

Using curl to download the file

The first step is to get this file onto the cluster where you can use pico to edit it. You have already learned several ways to create a file on the cluster:

  • cat - > filename ... Ctrl-d
  • pico filename

Now you will learn a third way: using the curl program. Curl is useful when you know that a file exists on the internet at some URL. Curl stands for "Copy URL" - it allows you to copy a file at a particular address (URL) on the internet to your present working directory. I have placed a copy of the garli.conf file above at the following URL:

You can view it in a web browser if you wish. Here is how to get this file copied from that web address to your home directory on the cluster. I assume you have already logged into the cluster using PuTTY:

curl > garli.conf

Curl acts a lot like cat: it basically spews the file to the console, so if you want to save it to a file, you need to redirect the output of curl to a file, which is what the > garli.conf part on the end is about.

Editing garli.conf with pico

Now fire up pico and edit this file:

pico garli.conf

You will only need to change two lines. Change this line

datafname = rana.phy

so that it looks like this instead

datafname = rbcl50.nex

Then change this line

ofprefix = ranaGarli

so that it looks like this instead

ofprefix = 50taxa

The ofprefix is used by GARLI to begin the name of all output files. I usually use something different than the data file name here. If you eventually want to delete all of the various files that GARLI creates, you can just say

rm -f 50taxa*

If, however, you specify ofprefix = rbcl50, then the command

rm -f rbcl50* 

would not only wipe out the files GARLI created, but would also delete the data file!

Once you have finished changing those two lines, exit pico using Ctrl-x, then create a directory named garlirun and move garli.conf into that directory.

Download the data file using curl

I have placed the data file (rbcL50.nex) at the following address:

so you can use curl to download this file to the garlirun directory as follows:

cd $HOME/garlirun
curl > rbcL50.nex

New.png Three changes were made to this section:

  • The last line above has been corrected (previously it did not include the > rbcl50nex part on the end)
  • I changed rbcl50.nex to rbcL50.nex to avoid confusing the lower case letter L (l) for the number one (1)
  • I substituted a Tomato sequence for one of the two Spinach sequences, so now there is no duplication in the rbcL50.nex data file (but note that the data set has changed)

Preparing the gogarli SGE script

Now return to your home directory (using the cd command) and create a gogarli script that will be fed to qsub to start the analysis. Use either pico or cat to create the file with this text:

#$ -o junk.txt -j y
cd $HOME/garlirun
Garli.94 garli.conf

This file will look very similar to the gopaup script you created in part A. The only difference is that the data and control file are in the directory garlirun (not pauprun), the name of the program is Garli.94 (this means Garli version 0.94), and GARLI expects the name of the control file (garli.conf) on the command line instead of the name of the data file (rbcl50.nex). Remember that the name of the data file was specified inside the control file.

Running GARLI

Run GARLI by issuing the qsub command:

qsub gogarli

Check progress every few minutes using the qstat command. This run will take 15 or 20 minutes. If you get bored, you can cd into the garlirun directory and use this command to see the tail end of the log file that GARLI creates automatically:

tail 50taxa.log00.log

The tail command is like the cat command except that it only shows you the last few lines of the file (which often is just what you need).

Mailing the tree to yourself

After GARLI has finished, you should download the tree file ( using the PSFTP get command, but here is another handy trick: you can email the tree to yourself using this command (issue this from within the garlirun directory where the tree file is located):

mail <

This command will send mail to, and the body of the email message will come from the file!