Difference between revisions of "Phylogenetics: Large Scale Maximum Likelihood Analyses"

Revision as of 01:01, 25 February 2014

	EEB 349: Phylogenetics
	This lab explores two programs (GARLI and RAxML) designed specifically for maximum likelihood analyses on a large scale (hundreds of taxa).
Warning: Paul is still working on this lab, so some instructions may be woefully wrong until this warning disappears! (After that, some things may still be wrong, but hopefully not woefully wrong.)

The data

The data that we will use today comprises 50 rbcL sequences from various green plants: 5 green algae, 5 bryophytes (mosses, liverworts, hornworts), 5 ferns, 9 gymnosperms, and 26 angiosperms (flowering plants; including some you will recognize such as oregano, coffee, tea, spinach, poison ivy, rice and chocolate). The tree on the right shows the tree I obtained from GARLI using the GY94 codon model, color-coded so that bryophytes are yellow, ferns are green, gymnosperms are blue and angiosperms are pink. This history of green plants shows several key innovations: embryos are what gave the first bryophytes an edge over aquatic algae on land; branched sporophytes and vascular tissues allowed the first ferns to grow taller and disperse more spores compared to their bryophyte ancestors; seeds and pollen were the big inventions that led to the rise of gymnosperms; and of course flowers allowed efficient pollination by insects and led the the diversification of the angiosperms.

Part A: Starting a GARLI run on the cluster

GARLI (Genetic Algorithm for Rapid Likelihood Inference) is a program written by Derrick Zwickl for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). GARLI now (as of version 2.01) gives you considerable choice in substitution models: GTR[+I][+G] or codon models for nucleotides, plus several choices for amino acids. The genetic algorithm (or GA, for short) search strategy used by GARLI is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as with all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree. By default, GARLI will conduct two independent searches. If you have a multicore processor (newer Intel-based Macs and PCs are duo-core), GARLI can take advantage of this and use all of your CPUs simultaneously.

Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but has the advantage that you will be able to analyze it several times using both GARLI and RAxML within a lab period. Instead of each of us running GARLI several times, we will each run it once and compare notes at the end of the lab.

Preparing the GARLI control file

Like many programs, GARLI uses a control file to specify the settings it will use during a run. Most of the default settings are fine, but you will need to change a few of them before running GARLI.

Obtain a copy of the control file

The first step is to obtain a copy of the GARLI default control file. Go to the GARLI download page and download a version of GARLI appropriate for your platform (Mac or Windows). The only reason you are downloading GARLI is to obtain a copy of the default control file; you should use the cluster for all your GARLI runs.

Once you have downloaded and unpacked GARLI on your computer, make a copy of the garli.conf.nuc.defaultSettings file and rename the copy garli.conf, then open it in your text editor.

Editing garli.conf

You will only need to change four lines.

Specify the data file name (note the capital L)

datafname = rbcL50.nex

Specify the prefix for output files

ofprefix = 50

The ofprefix is used by GARLI to begin the name of all output files. I usually use something different than the data file name here. That way, if you eventually want to delete all of the various files that GARLI creates, you can just say rm -f 50* ("remove all files beginning "50", the -f means "force" i.e. don't ask if it is ok, just do it) without wiping out your data file as well! (Sounds like the voice of experience, doesn't it?!)

Specify no invariable sites

invariantsites = none

This will cause GARLI to use the GTR+G model rather than the GTR+I+G model, which will facilitate comparisons with RAxML.

Do only one search replicate

searchreps = 1

Save the garli.conf file when you have made these changes.

The tip of the GARLI iceberg

As you can see from the number of entries in the control file, we are not going to learn all there is to know about GARLI in one lab session. One major omission is any discussion about bootstrapping, which is very easy to do in GARLI: just set bootstrapreps to some number other than 0 (e.g. 100) in your garli.conf file. I encourage you to download and read the excellent GARLI manual, especially if you want to use amino acid or codon models.

Log into the cluster

Log into the cluster using the command:

ssh bbcsrv3.biotech.uconn.edu

Go back to the Phylogenetics: Bioinformatics Cluster lab if you've forgotten some details.

Create a folder and a script for the run

Create a directory named garlirun inside your home directory and use your favorite file transfer method (scp, psftp, Fugu, FileZilla, etc.) to get garli.conf into that directory.

Now download the data file into the garlirun directory:

curl http://hydrodictyon.eeb.uconn.edu/eeb5349/rbcL50.nex > rbcL50.nex

Finally, create the script file you will hand to the qsub command to start the run. Use the nano editor to create a file named gogarli inside the garlirun directory with the following contents:

#$ -cwd
#$ -S /bin/bash
#$ -o out.txt
#$ -e err.txt
#$ -m ea
#$ -M your.name@uconn.edu
garli garli.conf

The lines starting with #$ represent commands that qsub understands.

The command #$ -cwd tells qsub to put output files in the current working directory, the directory in which it was invoked (which will be the garlirun directory).

The command #$ -S /bin/bash tells qsub to use the bash program to interpret the script. You have already been using the bash program; it interprets the commands you type when you are logged into the cluster. This qsub command just tells qsub to use the same program to interpret commands in the gogarli file.

The command #$ -o out.txt tells qsub to save any output into the file out.txt.

The command #$ -e err.txt tells qsub to save any error messages into the file err.txt.

Finally, the command #$ -m ea tells qsub to send you an email when your job ends (e) and if it aborts (a). The command #$ -M your.name@uconn.edu tells qsub what your email address is (be sure to specify your own email address here).

The last line starts garli.

Submit the job

Here is the command to start the job:

qsub gogarli

You should issue this command from inside the garlirun directory, which should contain 3 files: gogarli, rbcL50.nex and garli.conf.

Check progress every few minutes using the qstat command. This run will take about 4 minutes. If you get bored, you can cd into the garlirun directory and use this command to see the tail end of the log file that GARLI creates automatically:

tail 50.log00.log

The tail command is like the cat command except that it only shows you the last few lines of the file (which often is just what you need).

Files produced by GARLI

After your run finishes, you should find these files in your garlirun folder. Download them to your laptop and view them to answer the questions:

`50.screen.log`

This file saves the output that would have been displayed had you been running GARLI on your laptop.

How long did this run take?

(Search for "Time used" following "Final score")

What was the log-likelihood of the best tree?

(Look for "Replicate 1")

How many "generations" did GARLI do before it decided to stop?

(Look at the line just above "Reached termination condition!")

At which generation did it first find the tree that it finally reported?

Assuming each generation takes an equal amount of time, how long did it really take GARLI to find the best tree?

Looking at the model report at the end, what two categories of substitution occur at the fastest rate?

Also, is there a lot of rate heterogeneity in this data set?

What is the ratio of the fastest relative rate to the slowest?

`50.log00.log`

This file shows the best log-likelihood at periodic intervals throughout the run. It would be useful if you wanted to plot the progress of the run either as a function of time or generation.

`50.best.tre`

This is a NEXUS tree file that can be opened in FigTree, TreeView, PAUP*, or a number of other phylogenetic programs. Try using FigTree to open it. The best place to root it is on the branch leading to Nephroselmis. In FigTree, click this branch and use the Reroot tool to change the rooting. I also find that trees look better if you click the Order nodes checkbox, which is inside the Trees tab on the left side panel of FigTree.

Part B: Starting a RAxML run on the cluster

Another excellent ML program for large problems is RAxML, written by Alexandros Stamatakis. This program is exceptionally fast, and has been used to estimate maximum likelihood trees for 25,000 taxa! Let's run RAxML on the same data as GARLI and compare results.

Preparing the data file

While GARLI reads NEXUS files, RAxML uses a simpler format. It is easy to use the pico editor to make the necessary changes, however. First, make a copy of your rbcL50.nex file:

cp rbcL50.nex rbcL50.dat

Open rbcL50.dat in pico and use Ctrl-k repeatedly to remove these initial lines:

#nexus 

begin data;
  dimensions ntax=50 nchar=1314;
  format datatype=dna gap=- missing=?;
  matrix

Add a new first line to the file that looks like this:

50 1314

Now use the down arrow to go to the end of the file and remove the last two lines:

;              
end;

Save the file using Ctrl-x and you are ready to run RAxML!

The tip of the RAxML iceberg

As with GARLI, RAxML is full of features that we will not have time to explore today. The manual does a nice job of explaining all the features so I recommend reading it if you use RAxML for your own data.

Running RAxML on the cluster

Hopefully, you have created the rbcL50.dat file in your garlirun directory. If not, go ahead and move it there. Then return to your home directory and use pico to create a gorax script file that contains the following:

#$ -o junk2.txt -j y
cd $HOME/garlirun
raxml -p 13579 -N 1 -e 0.00001 -m GTRMIX -s rbcL50.dat -n BASIC

You'll note that this is similar to the gogarli script we created earlier, but it is worth discussing each line before submitting the run to the cluster.

The first line is the same except that we specified junk2.txt rather than junk.txt (this is so that our RAxML run will not try to write to the same file as our GARLI run).

The second line is identical to the second line of our gogarli script. You could, of course, sequester the RAxML results in a different directory if you wanted, but it is safe to use the same folder because none of the RAxML output files will have exactly the same name as any of the GARLI output files.

The third line requires the most explanation. First, RAxML does not use a control file like GARLI, so all options must be specified on the command line when it is invoked. Let's take each option in turn:

-p 13579 provides a pseudorandom number seed to RAxML to use when it generates its starting tree (the p presumably stands for parsimony, which is the optimality criterion it uses to obtain a starting tree). It is a good idea to specify some number here so that you have the option of exactly recreating the analysis later.
-N 1 tells RAxML to just perform one search replicate.
-e 0.00001 sets the precision with which model parameters will be estimated. RAxML will search for better combinations of parameter values until it fails to increase the log-likelihood by more than this amount. Ordinarily, the default value (0.1) is sufficient, but we are making RAxML work harder so that the results are more comparable to GARLI, which does a fairly thorough final parameter optimization.
-m GTRMIX tells RAxML to use the GTR+CAT model for the search, then to switch to the GTR+G for final optimization of parameters (so that the likelihood is comparable to that produced by other programs).
-s rbcL50.dat provides the name of the data file.
-n BASIC supplies a suffix to be appended to all output file names

Start the run by entering this from your home directory (or where ever your gorax file is located):

qsub gorax

Bootstrapping with RAxML

After your first RAxML run finishes (probably within 2 minutes), start a second, longer run to perform bootstrapping. Modify your gorax file as follows:

#$ -o junk2.txt -j y
cd $HOME/garlirun
# raxml -p 13579 -N 1 -e 0.00001 -m GTRMIX -s rbcL50.dat -n BASIC
raxml -f a -x 12345 -p 13579 -N 100 -m GTRCAT -s rbcL50.dat -n FULL

Note that I've used a # character to comment out our previous raxml line. (Feel free to simply delete that line if you wish.) Go ahead and start this run using qsub. This one will take longer, but not as long as you might expect (about 10 minutes). It will conduct a bootstrap analysis involving 100 bootstrap replicates (-N 100) using the GTR+CAT model. The -x 12345 specifies a starting seed for the bootstrap resampling. For every 5 bootstrap replicates performed, RAxML will climb uphill on the original dataset starting from the tree estimated for that bootstrap replicate. This provides a series of searches for the maximum likelihood tree starting from different, but reasonable, starting trees. The -f a on the command line sets up this combination of bootstrapping and ML searching.

Files produced by RAxML

`RAxML_info.FULL`

This file contains some basic information about the run. Use this file to answer these questions:

How long did RAxML require to perform 100 bootstrap replicates?

How much time was spent searching for the ML tree?

How many independent searches for the ML tree were performed?

Which takes longer in RAxML: searching for the ML tree or performing a bootstrap replicate? Why is there a difference?

What is the log-likelihood of the best tree found by RAxML?

`RAxML_bestTree.FULL`

This file holds the best tree found. It is not a NEXUS tree file, but simply a tree description; however, FigTree is able to open such files.

`RAxML_bootstrap.FULL`

This file holds the trees resulting from bootstrapping (also not NEXUS format; one tree description per line). These trees do not have branch lengths. You can open this file in FigTree and use the arrow buttons to move from one to the next.

`RAxML_bipartitions.FULL`

This file contains the best tree with bootstrap support values embedded in the tree description. Load this tree into FigTree. FigTree will ask you what name you want to use for the support values. Pick a name such as "bootstraps" and click Ok. Once the tree is visible, check Node labels on the left, chooose "bootstraps" (or whatever you named them) from the Display list, and increase the font size so you can see it (ok, you are probably young enough that you can still see the numbers without magnification!).

Can you tell by viewing this tree in FigTree that this is not the bootstrap majority-rule consensus tree?

Comparing GARLI, RaxML, and FastTree

To compare the two programs, use the nano editor to create a tree file named combined.nex containing a trees block with the best tree from both programs and a paup block to compute the likelihoods of these three trees under the GTR+G model. Here I've simply inserted ellipses (...) as placeholders for the actual tree descriptions. It may be easier to construct this tree file on your own laptop and then upload it again to the cluster; use whichever approach is most convenient for you.

#nexus

begin paup;
 exe rbcL50.nex;
end;

begin trees;
 utree garli = (...);
 utree raxml = (...);
end;

begin paup;
    set criterion=likelihood;
    lset nst=6 rmatrix=estimate rates=gamma shape=estimate;
    lscores all;
    agree all;
end;

Which tree has the best log-likelihood? You will probably find that GARLI's likelihood is slightly better than RaxML. It is perhaps not too surprising that GARLI is best in this comparison since RAxML did its analysis under the GTR-CAT model instead of the GTR+G model, and thus GARLI was the only one of the three that actually searched under the same criterion by which we evaluated the performance of all three methods. Both of these approaches will give different answers if you run them multiple times under different random number seeds, so you should probably do several replicates (or a bootstrap analysis) before making anything too momentous of the results.

The last command given to PAUP was "agree all". This computes an agreement subtree from the results of the three analyses. Can you figure out from PAUP's output how many taxa (out of the 50 total) it had to omit in order to find an agreement subtree?

@@ Line 112: / Line 112: @@
 This file saves the output that would have been displayed had you been running GARLI on your laptop.
-<div style="background-color: #ddddff">How long did this run take?</div>
+<div style="background-color: #ddddff">How long did this run take?</div> (Search for "Time used" following "Final score")
-<div style="background-color: #ddddff">What was the log-likelihood of the best tree?</div>
+<div style="background-color: #ddddff">What was the log-likelihood of the best tree?</div> (Look for "Replicate 1")
-<div style="background-color: #ddddff">How many "generations" did GARLI do before it decided to stop?</div>
+<div style="background-color: #ddddff">How many "generations" did GARLI do before it decided to stop?</div> (Look at the line just above "Reached termination condition!")
 <div style="background-color: #ddddff">At which generation did it first find the tree that it finally reported?</div>
 <div style="background-color: #ddddff">Assuming each generation takes an equal amount of time, how long did it really take GARLI to find the best tree?</div>