Phylogenetics: Large Scale Maximum Likelihood Analyses
|EEB 349: Phylogenetics|
|Warning: Paul is still working on this lab, so some instructions may be woefully wrong until this warning disappears! (After that, some things may still be wrong, but hopefully not woefully wrong.)|
- 1 The data
- 2 Part A: Starting a GARLI run on the cluster
- 2.1 Preparing the GARLI control file
- 2.2 The tip of the GARLI iceberg
- 2.3 Log into the cluster
- 2.4 Create a folder and a script for the run
- 2.5 Submit the job
- 2.6 Files produced by GARLI
- 3 Part B: Starting a RAxML run on the cluster
- 4 Comparing GARLI, RaxML, and FastTree
The databryophytes are yellow, ferns are green, gymnosperms are blue and angiosperms are pink. This history of green plants shows several key innovations: embryos are what gave the first bryophytes an edge over aquatic algae on land; branched sporophytes and vascular tissues allowed the first ferns to grow taller and disperse more spores compared to their bryophyte ancestors; seeds and pollen were the big inventions that led to the rise of gymnosperms; and of course flowers allowed efficient pollination by insects and led the the diversification of the angiosperms.
Part A: Starting a GARLI run on the cluster
GARLI (Genetic Algorithm for Rapid Likelihood Inference) is a program written by Derrick Zwickl for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). GARLI now (as of version 2.01) gives you considerable choice in substitution models: GTR[+I][+G] or codon models for nucleotides, plus several choices for amino acids. The genetic algorithm (or GA, for short) search strategy used by GARLI is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as with all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree. By default, GARLI will conduct two independent searches. If you have a multicore processor (newer Intel-based Macs and PCs are duo-core), GARLI can take advantage of this and use all of your CPUs simultaneously.
Preparing the GARLI control file
Obtain a copy of the control file
The first step is to obtain a copy of the GARLI default control file. Go to the GARLI download page and download a version of GARLI appropriate for your platform (Mac or Windows). The only reason you are downloading GARLI is to obtain a copy of the default control file; you should use the cluster for all your GARLI runs.
Once you have downloaded and unpacked GARLI on your computer, make a copy of the garli.conf.nuc.defaultSettings file and rename the copy garli.conf, then open it in your text editor.
You will only need to change four lines.
Specify the data file name (note the capital L)
datafname = rbcL50.nex
Specify the prefix for output files
ofprefix = 50
The ofprefix is used by GARLI to begin the name of all output files. I usually use something different than the data file name here. That way, if you eventually want to delete all of the various files that GARLI creates, you can just say rm -f 50* ("remove all files beginning "50", the -f means "force" i.e. don't ask if it is ok, just do it) without wiping out your data file as well! (Sounds like the voice of experience, doesn't it?!)
Specify no invariable sites
invariantsites = none
Do only one search replicate
searchreps = 1
Save the garli.conf file when you have made these changes.
The tip of the GARLI iceberg
Log into the cluster
Log into the cluster using the command:
Go back to the Phylogenetics: Bioinformatics Cluster lab if you've forgotten some details.
Create a folder and a script for the run
Now download the data file into the garlirun directory:
curl http://hydrodictyon.eeb.uconn.edu/eeb5349/rbcL50.nex > rbcL50.nex
Finally, create the script file you will hand to the qsub command to start the run. Use the pico (a.k.a. nano) editor to create a file named gogarli in your home directory with the following contents:
#$ -o junk.txt -j y cd $HOME/garlirun garli garli.conf
Submit the job
Here is the command to start the job:
You should issue this command from your home directory, or where ever you saved the gogarli file.
Files produced by GARLI
This file saves the output that would have been displayed had you been running GARLI on your laptop.
Part B: Starting a RAxML run on the cluster
Preparing the data file
cp rbcL50.nex rbcL50.dat
Open rbcL50.dat in pico and use Ctrl-k repeatedly to remove these initial lines:
#nexus begin data; dimensions ntax=50 nchar=1314; format datatype=dna gap=- missing=?; matrix
Add a new first line to the file that looks like this:
Now use the down arrow to go to the end of the file and remove the last two lines:
Save the file using Ctrl-x and you are ready to run RAxML!
The tip of the RAxML iceberg
Running RAxML on the cluster
The first line is the same except that we specified junk2.txt rather than junk.txt (this is so that our RAxML run will not try to write to the same file as our GARLI run).
- -p 13579 provides a pseudorandom number seed to RAxML to use when it generates its starting tree (the p presumably stands for parsimony, which is the optimality criterion it uses to obtain a starting tree). It is a good idea to specify some number here so that you have the option of exactly recreating the analysis later.
- -N 1 tells RAxML to just perform one search replicate.
- -s rbcL50.dat provides the name of the data file.
- -n BASIC supplies a suffix to be appended to all output file names
Start the run by entering this from your home directory (or where ever your gorax file is located):
Bootstrapping with RAxML
Files produced by RAxML
This file contains some basic information about the run. Use this file to answer these questions:
Comparing GARLI, RaxML, and FastTree
To compare the two programs, use the nano editor to create a tree file named combined.nex containing a trees block with the best tree from both programs and a paup block to compute the likelihoods of these three trees under the GTR+G model. Here I've simply inserted ellipses (...) as placeholders for the actual tree descriptions. It may be easier to construct this tree file on your own laptop and then upload it again to the cluster; use whichever approach is most convenient for you.
#nexus begin paup; exe rbcL50.nex; end; begin trees; utree garli = (...); utree raxml = (...); end; begin paup; set criterion=likelihood; lset nst=6 rmatrix=estimate rates=gamma shape=estimate; lscores all; agree all; end;
Which tree has the best log-likelihood? You will probably find that GARLI's likelihood is slightly better than RaxML. It is perhaps not too surprising that GARLI is best in this comparison since RAxML did its analysis under the GTR-CAT model instead of the GTR+G model, and thus GARLI was the only one of the three that actually searched under the same criterion by which we evaluated the performance of all three methods. Both of these approaches will give different answers if you run them multiple times under different random number seeds, so you should probably do several replicates (or a bootstrap analysis) before making anything too momentous of the results.
The last command given to PAUP was "agree all". This computes an agreement subtree from the results of the three analyses. Can you figure out from PAUP's output how many taxa (out of the 50 total) it had to omit in order to find an agreement subtree?