Phylogenetics: Large Scale Maximum Likelihood Analyses
|EEB 349: Phylogenetics|
|This lab explores two programs (GARLI and RAxML) designed specifically for maximum likelihood analyses on a large scale (hundreds of taxa).|
- 1 Part A: Starting a GARLI run on the cluster
- 2 Part B: Starting a RAxML run on the cluster
Part A: Starting a GARLI run on the cluster
GARLI (Genetic Algorithm for Rapid Likelihood Inference) is a program written by Derrick Zwickl for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). GARLI now (as of version 0.96) gives you considerable choice in substitution models: GTR[+I][+G] or codon models for nucleotides, plus several choices for amino acids. The genetic algorithm (or GA, for short) search strategy used by GARLI is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as with all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree. By default, GARLI will conduct two independent searches. If you have a multicore processor (newer Intel-based Macs and PCs are duo-core), GARLI can take advantage of this and use all of your CPUs simultaneously.
Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but has the advantage that you will be able to analyze it several times using both GARLI and RAxML within a lab period. Instead of each of us running GARLI several times, we will each run it once and compare notes at the end of the lab.
Preparing the GARLI control file
Like many programs, GARLI uses a control file to specify the settings it will use during a run. Most of the default settings are fine, but you will need to change a few of them before running GARLI.
Obtain a copy of the control file
The first step is to obtain a copy of the GARLI default control file. Go to the GARLI download page and download a version of GARLI appropriate for your platform (Mac or Windows). For now, the only reason you are downloading GARLI is to obtain a copy of the default control file. However, because GARLI is multithreaded, you may find that it is faster to run it on your multi-core laptop than on the cluster. Running on the cluster has advantages, however, even if it is slower. For one, if you have a truly large data set, you using the cluster means that your laptop is not tied up for hours.
Once you have downloaded and unpacked GARLI on your computer, copy the garli.conf.nuc.defaultSettings to a file named simply garli.conf and open it in your text editor.
You will only need to change four lines.
Specify the data file name (note the capital L)
datafname = rbcL50.nex
Specify the prefix for output files
ofprefix = 50taxa
The ofprefix is used by GARLI to begin the name of all output files. I usually use something different than the data file name here. That way, if you eventually want to delete all of the various files that GARLI creates, you can just say
rm -f 50taxa*
without wiping out your data file as well!
Specify no invariable sites so that GARLI uses the GTR+G model rather than the GTR+I+G model
invariantsites = none
Do only one search replicate
searchreps = 1
Save the garli.conf file when you have made these changes.
The tip of the GARLI iceberg
As you can see from the number of entries in the control file, we are not going to learn all there is to know about GARLI in one lab session. One major omission is any discussion about bootstrapping, which is very easy to do in GARLI: just set bootstrapreps to some number other than 0 (e.g. 100) in your garli.conf file. I encourage you to download and read the excellent GARLI manual, especially if you want to use amino acid or codon models.
Log into the cluster
Log into the cluster using the command:
Go back to the Phylogenetics: Bioinformatics Cluster lab if you've forgotten some details.
Create a folder and a script for the run
Create a directory named garlirun inside your home directory and use your favorite file transfer method (scp, psftp, Fugu, FileZilla, etc.) to get garli.conf into that directory.
Now download the data file into the garlirun directory:
curl http://hydrodictyon.eeb.uconn.edu/eeb5349/rbcL50.nex > garlirun
Finally, create the script file you will hand to the qsub command to start the run. Use the pico editor to create a file named gogarli in your home directory with the following contents:
#$ -o junk.txt -j y cd $HOME/garlirun garli garli.conf
Submit the job
Here is the command to start the job:
You should issue this command from your home directory, or where ever you saved the gogarli file.
Check progress every few minutes using the qstat command. This run will take 15 or 20 minutes. If you get bored, you can cd into the garlirun directory and use this command to see the tail end of the log file that GARLI creates automatically:
The tail command is like the cat command except that it only shows you the last few lines of the file (which often is just what you need).
Part B: Starting a RAxML run on the cluster
Another excellent ML program for large problems is RAxML, written by Alexandros Stamatakis. This program is exceptionally fast, and has been used to estimate maximum likelihood trees for 25,000 taxa! Let's run RAxML on the same data as GARLI and compare results.
Preparing the data file
While GARLI reads NEXUS files, RAxML uses a simpler format. It is easy to use the pico editor to make the necessary changes, however. First, make a copy of your rbcL50.nex file:
cp rbcL50.nex rbcL50.dat
Open rbcL50.dat in pico and use Ctrl-K repeatedly to remove these initial lines:
begin data; dimensions ntax=50 nchar=1314; format datatype=dna gap=- missing=?; matrix
Add a new first line to the file that looks like this:
Now use the down arrow to go to the end of the file and remove the last two lines:
Save the file using Ctrl-X and you are ready to run RAxML!
The tip of the RAxML iceberg
As with GARLI, RAxML is full of features that we will not have time to explore today. The manual does a nice job of explaining all the features so I recommend reading it if you use RAxML for your own data.
Running RAxML on the cluster
Hopefully, you have created the rbcL50.dat file in your garlirun directory. If not, go ahead and move it there. Then return to your home directory and use pico to create a gorax script file that contains the following:
#$ -o junk2.txt -j y cd $HOME/garlirun raxml -p 13579 -N 1 -e 0.00001 -m GTRMIX -s rbcL50.dat -n BASIC
You'll note that this is similar to the gogarli script we created earlier, but it is worth discussing each line before submitting the run to the cluster.
The first line is the same except that we specified junk2.txt rather than junk.txt (this is so that our RAxML run will not try to write to the same file as our GARLI run, which is probably still going).
The second line is identical to the second line of our gogarli script. You could, of course, sequester the RAxML results in a different directory if you wanted, but it is safe to use the same folder because none of the RAxML output files will have exactly the same name as any of the GARLI output files.
The third line requires the most explanation. First, RAxML does not use a control file like GARLI, so all options must be specified on the command line when it is invoked. Let's take each option in turn:
- -p 13579 provides a pseudorandom number seed to RAxML to use when it generates its starting tree using (the p presumably stands for parsimony, which is the optimality criterion it uses to obtain a starting tree). It is a good idea to specify some number here so that you can exactly recreate the analysis later (for purposes of reporting a potential bug, for example).
- -N 1 tells RAxML to just perform one search replicate.
- -e 0.00001 sets the precision with which model parameters will be estimated. RAxML will search for better combinations of parameter values until it fails to increase the log-likelihood by more than this amount. Ordinarily, the default value (0.1) is sufficient, but we are making RAxML work harder so that the results are more comparable to GARLI, which does a fairly thorough final parameter optimization.
- -m GTRMIX tells RAxML to use the GTR+CAT model for the search, then to switch to the GTR+G for final optimization of parameters (so that the likelihood is comparable to that produced by other programs).
- -s rbcL50.dat provides the name of the data file.
- -n BASIC supplies a suffix to be appended to all output file names
Start the run by entering this from your home directory (or where ever your gorax file is located):
Bootstrapping with RAxML
After your first RAxML run finishes, start a second, longer run to perform bootstrapping. Modify your gorax file as follows:
#$ -o junk2.txt -j y cd $HOME/garlirun # raxml -p 13579 -N 1 -e 0.00001 -m GTRMIX -s rbcL50.dat -n BASIC raxml -f a -x 12345 -p 13579 -N 100 -m GTRCAT -s rbcL50.dat -n FULL
Note that I've used a # character to comment out our previous raxml line. Feel free to simply delete that line if you wish. Go ahead and start this run using qsub. This one will take awhile. It will conduct a bootstrap analysis involving 100 bootstrap replicates (-N 50) using the GTR+CAT model. The -x 12345 specifies a starting seed for the bootstrap resampling. For every 5 bootstrap replicates performed, RAxML will climb uphill on the original dataset starting from the tree estimated for that bootstrap replicate. This provides a series of searches for the maximum likelihood tree starting from different, but reasonable, starting trees. The -f a on the command line sets up this combination of bootstrapping and ML searching.