Difference between revisions of "Phylogenetics: Large Scale Maximum Likelihood Analyses"

From EEBedia
Jump to: navigation, search
(Preparing to run qsub)
(Challenge)
 
(88 intermediate revisions by 4 users not shown)
Line 2: Line 2:
 
|-
 
|-
 
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]
 
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span>
+
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
 
|-
 
|-
| This lab explores two programs (GARLI and RAxML) designed specifically for maximum likelihood analyses on a large scale (hundreds of taxa). If there is time, we will try a third option, FastTree, which is designed for ML analyses of hundreds of thousands of sequences.
+
|
 
|}
 
|}
 +
== Goals ==
 +
 +
Continue learning anout ML inference and introduce you to two programs (GARLI and RAxML) designed specifically for maximum likelihood analyses on a large scale (hundreds of taxa).
  
 
== The data ==
 
== The data ==
  
[[Image:garliCodon.png|right|500px|GARLI best tree under codon model]] The data that we will use today comprises 50 rbcL sequences from various green plants: 5 green algae, 5 bryophytes (mosses, liverworts, hornworts), 5 ferns, 9 gymnosperms, and 26 angiosperms (flowering plants; including some you will recognize such as oregano, coffee, tea, spinach, poison ivy, rice and chocolate). The tree on the right shows the tree I obtained from GARLI using the GY94 codon model, color-coded so that <span style="background-color: black; color: yellow">bryophytes are yellow</span>, <span style="color: green">ferns are green</span>, <span style="color: blue">gymnosperms are blue</span> and <span style="color: magenta">angiosperms are pink</span>. This history of green plants shows several key innovations: '''embryos''' are what gave the first bryophytes an edge over aquatic algae on land; '''branched sporophytes''' and '''vascular tissues''' allowed the first ferns to grow taller and disperse more spores compared to their bryophyte ancestors; '''seeds''' and '''pollen''' were the big inventions that led to the rise of gymnosperms; and of course '''flowers''' allowed efficient pollination by insects and led the the diversification of the angiosperms.
+
[[Image:garliCodon.png|right|500px|GARLI best tree under codon model]] The data that we will use today comprises 50 rbcL gene sequences from various green plants: 5 green algae, 5 bryophytes (mosses, liverworts, hornworts), 5 ferns, 9 gymnosperms, and 26 angiosperms (flowering plants; including some you will recognize such as oregano, coffee, tea, spinach, poison ivy, rice and chocolate). The tree on the right shows the tree I obtained from GARLI using the GY94 codon model, color-coded so that <span style="background-color: black; color: yellow">bryophytes are yellow</span>, <span style="color: green">ferns are green</span>, <span style="color: blue">gymnosperms are blue</span> and <span style="color: magenta">angiosperms are pink</span>. This history of green plants shows several key innovations: '''embryos''' are what gave the first bryophytes an edge over aquatic algae on land; '''branched sporophytes''' and '''vascular tissues''' allowed the first ferns to grow taller and disperse more spores compared to their bryophyte ancestors; '''seeds''' and '''pollen''' were the big inventions that led to the rise of gymnosperms; and of course '''flowers''' allowed efficient pollination by insects and led the the diversification of the angiosperms.
  
 
== Part A: Starting a GARLI run on the cluster ==
 
== Part A: Starting a GARLI run on the cluster ==
  
[http://code.google.com/p/garli/ GARLI] (Genetic Algorithm for Rapid Likelihood Inference) is a program written by [http://phylo.bio.ku.edu/content/derrick-zwickl Derrick Zwickl] for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). GARLI now (as of version 1.0) gives you considerable choice in substitution models: GTR[+I][+G] or codon models for nucleotides, plus several choices for amino acids. The genetic algorithm (or GA, for short) search strategy used by GARLI is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as with all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree. By default, GARLI will conduct two independent searches. If you have a multicore processor (newer Intel-based Macs and PCs are duo-core), GARLI can take advantage of this and use all of your CPUs simultaneously.
+
[http://code.google.com/p/garli/ GARLI] (Genetic Algorithm for Rapid Likelihood Inference) is a program written by Derrick Zwickl for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). GARLI now (as of version 2.01) gives you considerable choice in substitution models: GTR[+I][+G] or codon models for nucleotides, plus several choices for amino acids. The genetic algorithm (or GA, for short) search strategy used by GARLI is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as with all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree. By default, GARLI will conduct two independent searches. If you have a multicore processor (newer Intel-based Macs and PCs are duo-core), GARLI can take advantage of this and use all of your CPUs simultaneously.
  
 
Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but has the advantage that you will be able to analyze it several times using both GARLI and RAxML within a lab period. Instead of each of us running GARLI several times, we will each run it once and compare notes at the end of the lab.
 
Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but has the advantage that you will be able to analyze it several times using both GARLI and RAxML within a lab period. Instead of each of us running GARLI several times, we will each run it once and compare notes at the end of the lab.
Line 23: Line 26:
 
==== Obtain a copy of the control file ====
 
==== Obtain a copy of the control file ====
  
The first step is to obtain a copy of the GARLI default control file. Go to the GARLI download page and download a version of GARLI appropriate for your platform (Mac or Windows). For now, the only reason you are downloading GARLI is to obtain a copy of the default control file. However, because GARLI is multithreaded, you may find that it is faster to run it on your multi-core laptop than on the cluster. Running on the cluster has advantages, however, even if it is slower. For one, if you have a truly large data set, using the cluster means that your laptop is not tied up for hours.
+
The first step is to obtain a copy of the GARLI default control file. Go to the GARLI download page and download a version of GARLI appropriate for your platform (Mac or Windows). The only reason you are downloading GARLI is to obtain a copy of the default control file so don't actually install GARLI on your computer; you should use the cluster for all your GARLI runs.
  
Once you have downloaded and unpacked GARLI on your computer, copy the <tt>garli.conf.nuc.defaultSettings</tt> to a file named simply <tt>garli.conf</tt> and open it in your text editor.
+
Once you have downloaded and unpacked GARLI on your computer, make a copy of the <tt>garli.conf.nuc.defaultSettings</tt> file and rename the copy <tt>garli.conf</tt>, then open it in your text editor.
  
 
==== Editing garli.conf ====
 
==== Editing garli.conf ====
Line 36: Line 39:
 
'''Specify the prefix for output files'''
 
'''Specify the prefix for output files'''
 
  ofprefix = 50
 
  ofprefix = 50
The ofprefix is used by GARLI to begin the name of all output files. I usually use something ''different'' than the data file name here. That way, if you eventually want to delete all of the various files that GARLI creates, you can just say <tt>rm -f 50*</tt> without wiping out your data file as well! (Sounds like the voice of experience, doesn't it?!)
+
The ofprefix is used by GARLI to begin the name of all output files. I usually use something ''different'' than the data file name here. That way, if you eventually want to delete all of the various files that GARLI creates, you can just say <tt>rm -f 50*</tt> ("remove all files beginning "50", the -f means "force" i.e. don't ask if it is ok, just do it) without wiping out your data file as well! (Sounds like the voice of experience, doesn't it?!)
 +
 
 +
'''Do only one search replicate'''
 +
searchreps = 1
  
 
'''Specify no invariable sites'''  
 
'''Specify no invariable sites'''  
 
  invariantsites = none
 
  invariantsites = none
 
This will cause GARLI to use the GTR+G model rather than the GTR+I+G model, which will facilitate comparisons with RAxML.
 
This will cause GARLI to use the GTR+G model rather than the GTR+I+G model, which will facilitate comparisons with RAxML.
 
'''Do only one search replicate'''
 
searchreps = 1
 
  
 
Save the <tt>garli.conf</tt> file when you have made these changes.
 
Save the <tt>garli.conf</tt> file when you have made these changes.
Line 49: Line 52:
 
=== The tip of the GARLI iceberg ===
 
=== The tip of the GARLI iceberg ===
  
As you can see from the number of entries in the control file, we are not going to learn all there is to know about GARLI in one lab session. One major omission is any discussion about bootstrapping, which is very easy to do in GARLI: just set <tt>bootstrapreps</tt> to some number other than 0 (e.g. 100) in your <tt>garli.conf</tt> file. I encourage you to download and read the excellent GARLI manual, especially if you want to use amino acid or codon models.
+
As you can see from the number of entries in the control file, we are not going to learn all there is to know about GARLI in one lab session. One major omission is any discussion about bootstrapping, which is very easy to do in GARLI: just set <tt>bootstrapreps</tt> to some number other than 0 (e.g. 100) in your <tt>garli.conf</tt> file ('''but don't do this today'''). I encourage you to download and read the excellent GARLI manual, especially if you want to use amino acid or codon models.
  
 
=== Log into the cluster ===
 
=== Log into the cluster ===
  
 
Log into the cluster using the command:
 
Log into the cluster using the command:
  ssh bbcxsrv1.biotech.uconn.edu
+
  ssh bbcsrv3.biotech.uconn.edu
 
Go back to the [[Phylogenetics: Bioinformatics Cluster]] lab if you've forgotten some details.
 
Go back to the [[Phylogenetics: Bioinformatics Cluster]] lab if you've forgotten some details.
  
 
=== Create a folder and a script for the run ===
 
=== Create a folder and a script for the run ===
Create a directory named <tt>garlirun</tt> inside your home directory and use your favorite file transfer method (scp, psftp, Fugu, FileZilla, etc.) to get <tt>garli.conf</tt> into that directory.
+
Create a directory named <tt>garliraxml</tt> inside your home directory and use your favorite file transfer method (scp, psftp, Fugu, FileZilla, etc.) to get <tt>garli.conf</tt> into that directory.
  
Now download the data file into the <tt>garlirun</tt> directory:
+
Now download the data file into the <tt>garliraxml</tt> directory:
  curl http://hydrodictyon.eeb.uconn.edu/eeb5349/rbcL50.nex > rbcL50.nex
+
  curl http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/rbcL50.nex > rbcL50.nex
  
Finally, create the script file you will hand to the <tt>qsub</tt> command to start the run. Use the pico (a.k.a. nano) editor to create a file named <tt>gogarli</tt> in your home directory with the following contents:
+
Finally, create the script file you will hand to the <tt>qsub</tt> command to start the run. Use the nano editor to create a file named <tt>gogarli</tt> inside the <tt>garliraxml</tt> directory with the following contents:
  #$ -o junk.txt -j y
+
#$ -cwd
  cd $HOME/garlirun
+
#$ -S /bin/bash
 +
  #$ -o out.txt
 +
#$ -e err.txt
 +
  #$ -m ea
 +
#$ -M your.name@uconn.edu
 +
#$ -N jobname
 
  garli garli.conf
 
  garli garli.conf
 +
 +
The lines starting with <tt>#$</tt> represent commands that qsub understands. Any line starting with a hash (#) is interpreted as a comment by all linux/unix interpreters, but the extra dollar sign ($) immediately after the hash is used by the qsub program as a flag, telling it that what follows should be interpreted as a command. All of these could be entered on the qsub command line as well, but it is easy to forget something unless you put it in a script like this.
 +
 +
The command <tt>#$ -cwd</tt> tells qsub to put output files in the current working directory, the directory in which it was invoked (which will be the <tt>garliraxml</tt> directory).
 +
 +
The command <tt>#$ -S /bin/bash</tt> tells qsub to use the bash program to interpret the script. You have already been using the bash program; it interprets the commands you type when you are logged into the cluster. This qsub command just tells qsub to use the same program to interpret commands in the <tt>gogarli</tt> file.
 +
 +
The command <tt>#$ -o out.txt</tt> tells qsub to save any output into the file <tt>out.txt</tt>.
 +
 +
The command <tt>#$ -e err.txt</tt> tells qsub to save any error messages into the file <tt>err.txt</tt>.
 +
 +
The command <tt>#$ -m ea</tt> tells qsub to send you an email when your job ends (e) or aborts (a). The command <tt>#$ -M your.name@uconn.edu</tt> tells qsub what your email address is (be sure to specify your own email address here). This is an extremely helpful feature! This feature may not work if you provide a non-uconn email address (feel free to try your gmail address and let me know if it works).
 +
 +
Finally, the -N command provides a job name that qsub will use to identify your job. This is entirely optional, but I find it very helpful to give short names to my runs so that when I get the email I know which run has finished (most useful if you have several runs going simultaneously). Be sure to replace "jobname" with something more meaningful (keep your job name simple - no embedded spaces or punctuation, and you will find that it is better to use 6 characters or fewer, as longer job names get truncated in qstat output).
 +
 +
The last line starts garli.
  
 
=== Submit the job ===
 
=== Submit the job ===
Line 72: Line 96:
 
Here is the command to start the job:
 
Here is the command to start the job:
 
  qsub gogarli
 
  qsub gogarli
You should issue this command from your home directory, or where ever you saved the <tt>gogarli</tt> file.
+
You should issue this command from inside the <tt>garliraxml</tt> directory, which should contain 3 files: <tt>gogarli</tt>, <tt>rbcL50.nex</tt> and <tt>garli.conf</tt>.
  
Check progress every few minutes using the [http://137.99.46.187/wiki/index.php/Qstat qstat] command. This run will take about 10 minutes. If you get bored, you can <tt>cd</tt> into the <tt>garlirun</tt> directory and use this command to see the tail end of the log file that GARLI creates automatically:
+
Check progress every few minutes using the [http://137.99.46.187/wiki/index.php/Qstat qstat] command. This run will take about 4 minutes. If you get bored, you can use this command to see the tail end of the log file that GARLI creates automatically:
 
  tail 50.log00.log
 
  tail 50.log00.log
The <tt>tail</tt> command is like the <tt>cat</tt> command except that it only shows you the last few lines of the file (which often is just what you need).
+
The <tt>tail</tt> command is like the <tt>cat</tt> command except that it only shows you the last few lines of the file (which often is just what you need). If you need to see more lines at the end  of the file, you can specify the number using the -n option:
 +
tail -n 100 50.log00.log
 +
The above command will show the last 100 lines.
 
<!--  
 
<!--  
 
=== Mailing the tree to yourself ===
 
=== Mailing the tree to yourself ===
  
After GARLI has finished, you should download the tree file (50.best.tre) using the PSFTP get command, but here is another handy trick: you can email the tree to yourself using this command (issue this from within the garlirun directory where the tree file is located):
+
After GARLI has finished, you should download the tree file (50.best.tre) using the PSFTP get command, but here is another handy trick: you can email the tree to yourself using this command (issue this from within the garliraxml directory where the tree file is located):
 
  mail paul.lewis@uconn.edu < 50.best.tre
 
  mail paul.lewis@uconn.edu < 50.best.tre
 
This command will send mail to paul.lewis@uconn.edu, and the body of the email message will come from the file 50.best.tre!
 
This command will send mail to paul.lewis@uconn.edu, and the body of the email message will come from the file 50.best.tre!
Line 87: Line 113:
 
=== Files produced by GARLI ===
 
=== Files produced by GARLI ===
  
After your run finishes, you should find these files in your <tt>garlirun</tt> folder. Download them to your laptop and view them to answer the questions:
+
After your run finishes, you should find these files in your <tt>garliraxml</tt> folder. Download them to your laptop and view them to answer the questions:
  
 
==== <tt>50.screen.log</tt> ====
 
==== <tt>50.screen.log</tt> ====
This file saves the output that would have been displayed had you been running GARLI on your laptop.  
+
This file saves the output that would have been displayed had you been running GARLI on your laptop. (Hover over the phrase "Paul's result" below to see what values I saw in my 50.screen.log file; your run may differ because we all started with a different seed.)
  
<div style="background-color: #ddddff">How long did this run take?</div>
+
<div style="background-color: #ddddff">How long did this run take? (Search for "Time used" following "Final score") {{title|3 minutes and 40 seconds|Paul's result}}</div>
<div style="background-color: #ddddff">What was the log-likelihood of the best tree?</div>
+
<div style="background-color: #ddddff">What was the log-likelihood of the best tree? (Look for "Final score") {{title|-18219.5863|Paul's result}}</div>
<div style="background-color: #ddddff">How many "generations" did GARLI do before it decided to stop?</div>
+
<div style="background-color: #ddddff">How many "generations" did GARLI do before it decided to stop? (Look at the line just above "Reached termination condition!") {{title|26200|Paul's result}}</div>
<div style="background-color: #ddddff">At which generation did it first find the tree that it finally reported?</div>
+
<div style="background-color: #ddddff">At which generation did it first find the tree that it finally reported? (Look at the last value in the line you used to answer the previous question.) {{title|3020|Paul's result}}</div>
<div style="background-color: #ddddff">Assuming each generation takes an equal amount of time, how long did it really take GARLI to find the best tree?</div>
+
<div style="background-color: #ddddff">Assuming each generation takes an equal amount of time, how long did it really take GARLI to find the best tree? {{title|(3020)(220)/26200 equals 25.4 seconds|Paul's result}}</div>
<div style="background-color: #ddddff">Looking at the model report at the end, what two categories of substitution occur at the fastest rate?</div>
+
<div style="background-color: #ddddff">Looking at the model report at the end, what two categories of substitution occur at the fastest rate? {{title|the transitions, A to G and C to T|Paul's result}}</div>
<div style="background-color: #ddddff">Also, is there a lot of rate heterogeneity in this data set?</div>
+
<div style="background-color: #ddddff">Also, is there a lot of rate heterogeneity in this data set? {{title|yes, gamma shape parameter is 0.2732|Paul's result}}</div>
<div style="background-color: #ddddff">What is the ratio of the fastest relative rate to the slowest?</div>
+
<div style="background-color: #ddddff">What is the ratio of the fastest relative rate to the slowest? {{title|9.2|Paul's result}}</div>
  
 
==== <tt>50.log00.log</tt> ====
 
==== <tt>50.log00.log</tt> ====
Line 107: Line 133:
 
==== <tt>50.best.tre</tt> ====
 
==== <tt>50.best.tre</tt> ====
  
This is a NEXUS tree file that can be opened in FigTree, TreeView, PAUP*, or a number of other phylogenetic programs. Try using [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] to open it. The best place to root it is on the branch leading to ''Nephroselmis''. In FigTree, click this branch and use the ''Reroot'' tool to change the rooting. I also find that trees look better if you click the ''Order nodes'' checkbox, which is inside the ''Trees'' tab on the left side panel of FigTree.
+
This is a NEXUS tree file that can be opened in FigTree, TreeView, PAUP*, or a number of other phylogenetic programs. Try using [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] to open it. The best place to root it is on the branch leading to ''Nephroselmis'' (use the search box near the top of the FigTree window to quickly find taxa). In FigTree, click this branch and use the ''Reroot'' tool to change the rooting. I also find that trees look better if you click the ''Order nodes'' checkbox, which is inside the ''Trees'' tab on the left side panel of FigTree.
  
 
== Part B: Starting a RAxML run on the cluster ==
 
== Part B: Starting a RAxML run on the cluster ==
  
Another excellent ML program for large problems is [http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm RAxML], written by [http://wwwkramer.in.tum.de/exelixis/ Alexandros Stamatakis]. This program is exceptionally fast, and has been used to estimate maximum likelihood trees for 25,000 taxa! Let's run RAxML on the same data as GARLI and compare results.
+
Another excellent ML program for large problems is [http://www.exelixis-lab.org/ RAxML], written by Alexandros Stamatakis. This program is exceptionally fast, and has been used to estimate maximum likelihood trees for 25,000 taxa! Let's run RAxML on the same data as GARLI and compare results.
  
 
=== Preparing the data file ===
 
=== Preparing the data file ===
  
While GARLI reads NEXUS files, RAxML uses a simpler format. It is easy to use the pico editor to make the necessary changes, however. First, make a copy of your <tt>rbcL50.nex</tt> file:
+
While GARLI reads NEXUS files, RAxML uses a simpler format. It is easy to use the nano editor to make the necessary changes, however. First, '''make a copy''' of your <tt>rbcL50.nex</tt> file:
 
  cp rbcL50.nex rbcL50.dat
 
  cp rbcL50.nex rbcL50.dat
  
Open <tt>rbcL50.dat</tt> in pico and use Ctrl-k repeatedly to remove these initial lines:
+
Open <tt>rbcL50.dat</tt> in nano and use Ctrl-k repeatedly to remove these initial lines before the taxa and their sequences:
 
  #nexus  
 
  #nexus  
 
   
 
   
Line 129: Line 155:
 
  50 1314
 
  50 1314
  
Now use the down arrow to go to the end of the file and remove the last two lines:
+
Now use the down arrow to go to the end of the file and remove the last two lines (again using Ctrl-k):
 
  ;               
 
  ;               
 
  end;
 
  end;
Line 137: Line 163:
 
=== The tip of the RAxML iceberg ===
 
=== The tip of the RAxML iceberg ===
  
As with GARLI, RAxML is full of features that we will not have time to explore today. The [http://icwww.epfl.ch/~stamatak/index-Dateien/countManual7.0.4.php manual] does a nice job of explaining all the features so I recommend reading it if you use RAxML for your own data.
+
As with GARLI, RAxML is full of features that we will not have time to explore today. Try typing <tt>raxml -h</tt> at the linux command prompt to see a brief description of all the available options. The [http://sco.h-its.org/exelixis/php/countManualNew.php manual] does a nice job of explaining all the features so I recommend reading it if you use RAxML for your own data.
  
 
=== Running RAxML on the cluster ===
 
=== Running RAxML on the cluster ===
  
Hopefully, you have created the <tt>rbcL50.dat</tt> file in your <tt>garlirun</tt> directory. If not, go ahead and move it there. Then return to your home directory and use pico to create a <tt>gorax</tt> script file that contains the following:
+
Hopefully, you have created the <tt>rbcL50.dat</tt> file in your <tt>garliraxml</tt> directory. If not, go ahead and move it there. Then use nano to create a <tt>gorax</tt> script file that contains the following:
  #$ -o junk2.txt -j y
+
#$ -cwd
  cd $HOME/garlirun
+
#$ -S /bin/bash
  raxml -p 13579 -N 1 -e 0.00001 -m GTRMIX -s rbcL50.dat -n BASIC
+
  #$ -o raxout.txt
 +
#$ -e raxerr.txt
 +
  #$ -m ea
 +
#$ -M your.name@uconn.edu
 +
#$ -N jobname
 +
  raxml -p 13579 -N 1 -e 0.00001 -m GTRCAT -s rbcL50.dat -n BASIC
  
You'll note that this is similar to the <tt>gogarli</tt> script we created earlier, but it is worth discussing each line before submitting the run to the cluster.  
+
You'll note that this is similar to the <tt>gogarli</tt> script we created earlier, but please read the explanation below '''before''' submitting the run to the cluster.  
  
The first line is the same except that we specified <tt>junk2.txt</tt> rather than <tt>junk.txt</tt> (this is so that our RAxML run will not try to write to the same file as our GARLI run).
+
The qsub command lines are the same except that we specified <tt>raxout.txt</tt> rather than <tt>out.txt</tt> and  <tt>raxerr.txt</tt> rather than <tt>err.txt</tt> (this is so that our RAxML run will not overwrite the files generated by our GARLI run).
  
The second line is identical to the second line of our <tt>gogarli</tt> script. You could, of course, sequester the RAxML results in a different directory if you wanted, but it is safe to use the same folder because none of the RAxML output files will have exactly the same name as any of the GARLI output files.
+
The last line invokes raxml and requires the most explanation. First, RAxML does not use a control file like GARLI, so all options must be specified on the command line when it is invoked. Let's take each option in turn:
 
+
* '''-p 13579''' provides a pseudorandom number seed to RAxML to use when it generates its starting tree. It is a good idea to specify some number here so that you have the option of exactly recreating the analysis later. By the way, the "p" stands for parsimony, as this seed is used to generate the random addition sequence for constructing a starting tree using parsimony.
The third line requires the most explanation. First, RAxML does not use a control file like GARLI, so all options must be specified on the command line when it is invoked. Let's take each option in turn:
+
* '''-p 13579''' provides a pseudorandom number seed to RAxML to use when it generates its starting tree (the p presumably stands for parsimony, which is the optimality criterion it uses to obtain a starting tree). It is a good idea to specify some number here so that you have the option of exactly recreating the analysis later.
+
 
* '''-N 1''' tells RAxML to just perform one search replicate.
 
* '''-N 1''' tells RAxML to just perform one search replicate.
 
* '''-e 0.00001''' sets the precision with which model parameters will be estimated. RAxML will search for better combinations of parameter values until it fails to increase the log-likelihood by more than this amount. Ordinarily, the default value (0.1) is sufficient, but we are making RAxML work harder so that the results are more comparable to GARLI, which does a fairly thorough final parameter optimization.
 
* '''-e 0.00001''' sets the precision with which model parameters will be estimated. RAxML will search for better combinations of parameter values until it fails to increase the log-likelihood by more than this amount. Ordinarily, the default value (0.1) is sufficient, but we are making RAxML work harder so that the results are more comparable to GARLI, which does a fairly thorough final parameter optimization.
* '''-m GTRMIX''' tells RAxML to use the GTR+CAT model for the search, then to switch to the GTR+G for final optimization of parameters (so that the likelihood is comparable to that produced by other programs).
+
* '''-m GTRCAT''' tells RAxML to use the GTR+CAT model for the search. I will explain in lecture how the CAT model works in RAxML - it is essentially a sites-specific-rates model for rate heterogeneity, but derives the categories and relative rates from the data set itself before the search begins.
 
* '''-s rbcL50.dat''' provides the name of the data file.
 
* '''-s rbcL50.dat''' provides the name of the data file.
 
* '''-n BASIC''' supplies a suffix to be appended to all output file names
 
* '''-n BASIC''' supplies a suffix to be appended to all output file names
  
Start the run by entering this from your home directory (or where ever your <tt>gorax</tt> file is located):
+
Start the run by entering this from within the <tt>garliraxml</tt> directory (where your <tt>rbcL50.dat</tt> and <tt>gorax</tt> files are located):
 
  qsub gorax
 
  qsub gorax
  
 
=== Bootstrapping with RAxML ===
 
=== Bootstrapping with RAxML ===
  
After your first RAxML run finishes (probably within 2 minutes), start a second, longer run to perform bootstrapping. Modify your <tt>gorax</tt> file as follows:
+
After your first RAxML run finishes (probably within a minute), start a second, longer run to perform bootstrapping. Modify the last line of your <tt>gorax</tt> file as follows:
#$ -o junk2.txt -j y
+
  # raxml -p 13579 -N 1 -e 0.00001 -m GTRCAT -s rbcL50.dat -n BASIC
cd $HOME/garlirun
+
  # raxml -p 13579 -N 1 -e 0.00001 -m GTRMIX -s rbcL50.dat -n BASIC
+
 
  raxml -f a -x 12345 -p 13579 -N 100 -m GTRCAT -s rbcL50.dat -n FULL
 
  raxml -f a -x 12345 -p 13579 -N 100 -m GTRCAT -s rbcL50.dat -n FULL
  
Line 178: Line 205:
  
 
This file contains some basic information about the run. Use this file to answer these questions:
 
This file contains some basic information about the run. Use this file to answer these questions:
<div style="background-color: #ddddff">How long did RAxML require to perform 100 bootstrap replicates?</div>
+
<div style="background-color: #ddddff">How long did RAxML require to perform 100 bootstrap replicates? {{title|116.512911 seconds|Paul's results}}</div>
<div style="background-color: #ddddff">How much time was spent searching for the ML tree?</div>
+
<div style="background-color: #ddddff">What is the log-likelihood of the best tree found by RAxML? {{title|-18329.025632|Paul's results}}</div>
<div style="background-color: #ddddff">How many independent searches for the ML tree were performed?</div>
+
<div style="background-color: #ddddff">Which takes longer in RAxML: searching for the ML tree or performing a bootstrap replicate? Why is there a difference?</div>
+
<div style="background-color: #ddddff">What is the log-likelihood of the best tree found by RAxML?</div>
+
  
 
==== <tt>RAxML_bestTree.FULL</tt> ====
 
==== <tt>RAxML_bestTree.FULL</tt> ====
Line 196: Line 220:
 
This file contains the best tree with bootstrap support values embedded in the tree description. Load this tree into FigTree. FigTree will ask you what name you want to use for the support values. Pick a name such as "bootstraps" and click Ok. Once the tree is visible, check ''Node labels'' on the left, chooose "bootstraps" (or whatever you named them) from the ''Display'' list, and increase the font size so you can see it (ok, you are probably young enough that you can still see the numbers without magnification!).
 
This file contains the best tree with bootstrap support values embedded in the tree description. Load this tree into FigTree. FigTree will ask you what name you want to use for the support values. Pick a name such as "bootstraps" and click Ok. Once the tree is visible, check ''Node labels'' on the left, chooose "bootstraps" (or whatever you named them) from the ''Display'' list, and increase the font size so you can see it (ok, you are probably young enough that you can still see the numbers without magnification!).
  
<div style="background-color: #ddddff">Can you tell by viewing this tree in FigTree that this is ''not'' the bootstrap majority-rule consensus tree?</div>
+
<div style="background-color: #ddddff">Can you tell by viewing this tree in FigTree that this is ''not'' the bootstrap majority-rule consensus tree? What evidence did you use? {{title|some support values are less than 50, which would not be the case for a majority rule tree|answer}}</div>
  
== Part C: Running FastTree on the cluster ==
+
We could have instructed RAxML to create the majority rule tree using the <tt>"­-J MR</tt> option, but by default it maps bootstrap support values onto the splits of the maximum likelihood tree topology.
  
[http://www.microbesonline.org/fasttree/ FastTree] is probably your best option if you have truly huge data sets. RaxML and/or GARLI are probably better options (more accurate) for data sets of only hundreds to thousands of taxa, but if you have hundreds of thousands of sequences, GARLI and RaxML will be slower than FastTree. FastTree is provided ready to run for Windows and Linux, but for Macs, you will need to compile it yourself. The cluster we've been using is composed of Macs, so your first step will be to download and compile FastTree. There is a lot of phylogenetic software out there, and much of it is available as source code only. Thus, this provides an opportunity to learn how to compile such programs yourself if you do not already know how to do this.
+
<!--
 +
== Computing the RAxML majority rule consensus tree using PAUP* ==
  
=== Downloading FastTree ===
+
RAxML performed bootstrapping but did not provide a majority rule consensus tree. Although RAxML could have been asked to create the majority rule tree (using the <tt>"­-J MR</tt> option), I did not have you do this so that I could show you how to do it in PAUP (this is a useful thing to know how to do). RAxML created the file <tt>RAxML_bootstrap.FULL</tt> containing the best tree from each of the 100 bootstrap replicates. Your goal in this section is to get PAUP to create the majority rule consensus tree from this file. To do this, create a nexus file named <tt>majrule.nex</tt> containing this text:
Begin by downloading the FastTree.c file to your home directory on <tt>bbcxsrv1.biotech.uconn.edu</tt>:
+
  #nexus
  curl http://www.microbesonline.org/fasttree/FastTree.c > FastTree.c
+
The <tt>curl</tt> command copies a file from a web site to your terminal, and we are using the operator <tt>></tt> to redirect the output of curl to a file named <tt>FastTree.c</tt>.  
+
begin paup;
 +
  exe rbcL50.nex;
 +
  gettrees file=RAxML_bootstrap.FULL;
 +
  contree all / nostrict majrule treefile=raxmlmajrule.tre;
 +
end;
 +
PAUP requires that you read in the data file before reading in the trees (so that it has a list of valid taxon names), hence the initial <tt>exe</tt> command. The <tt>gettrees</tt> command should be straightforward: this just tells PAUP to read in all the trees in the file <tt>RAxML_bootstrap.FULL</tt>. The last <tt>contree</tt> command instructs PAUP to create a majority rule consensus tree (<tt>majrule</tt>) from the trees it now has in memory, but not to create the strict consensus tree (<tt>nostrict</tt>) that it would ordinarily do by default. Finally, the <tt>treefile</tt> command provides the name of a file in which to save the majority rule consensus tree. As always, if you want to see all the options for a command like <tt>contree</tt>, just put a question mark after the command name in PAUP (and use the <tt>help</tt> command to get a list of all commands):
 +
paup> contree ?;
 +
 +
Usage: ConTree [tree-list] [/ options...] ;
 +
 +
Available options:
 +
 +
Keyword ------- Option type --------------------- Current setting ----------
 +
strict          no|yes                            no
 +
semistrict      no|yes                            no
 +
adams          no|yes                            no
 +
majRule        no|yes                            yes
 +
percent        <integer-value>                  50
 +
LE50            no|yes                            no
 +
useTreeWts      no|yes                            no
 +
grpFreq        no|yes                            yes
 +
indices        no|yes                            no
 +
showTree        no|yes                            yes
 +
treeFile        <tree-file-name>                  <none>
 +
saveSupport    no|brlens|nodeLabels|Both        nodeLabels
 +
replace        no|yes                          *no
 +
append          no|yes                          *no
 +
tCompress      no|yes                            no
 +
rootMethod      outgroup|lundberg|midpoint        outgroup
 +
outRoot        polytomy|paraphyl|monophyl        polytomy
 +
forcePolyt      no|yes                            no
 +
                                                  *Option is nonpersistent
 +
Run the <tt>majrule.nex</tt> in PAUP as follows (no need to qlogin for this, but do exit PAUP if you it is still running):
 +
paup majrule.nex
 +
and note that the majority rule tree was saved in the file <tt>raxmlmajrule.tre</tt>.
 +
-->
  
=== About compilers ===
+
== Comparing GARLI and RAxML Using PAUP* ==
To create an executable file, you need to run a C compiler. Compilers translate source code (in this case written in the computer language C) to binary (i.e. machine language). The Gnu compiler <tt>gcc</tt> is ubiquitous on all platforms except Windows, and it is present on the cluster as well. To check, type
+
which gcc
+
The <tt>which gcc</tt> command shows you what command (computer programs are called commands in unix), if any, would be run if you typed <tt>gcc</tt> at the unix prompt. The fact that the <tt>which gcc</tt> command yields "/usr/bin/gcc" means that the <tt>gcc</tt> program exists. Try typing <tt>which doofus</tt> to see what response is provided when a program does not exist.
+
  
If you want to try this on your Mac, you may need to install the [http://developer.apple.com/technologies/tools/ Developer Tools] in order to get gcc (try the which trick to see if it is already available on your mac). To compile under Windows, you will need to install either the [http://software.intel.com/en-us/articles/intel-compilers/ Intel compiler] (which will cost you something) or [http://www.microsoft.com/visualstudio/en-us/products/2010-editions/express Windows Visual Studio Express] (free). For today, we will just compile on the cluster, which is normally what you will want to do because that will allow you to run your analyses on the cluster rather than tying up your own computer.
+
To compare the two programs, use the nano editor to create a tree file named <tt>combined.nex</tt> and specify the best tree from both programs and a paup block to compute the likelihoods of these two trees under the GTR+G model. Your <tt>combined.nex</tt> command block should contain the following:
 +
 +
  #nexus
 +
        begin paup;
 +
        exe rbcL50.nex;
 +
        gettrees file=RAxML_bestTree.FULL;
 +
        deroot;
 +
        gettrees file=50.best.tre mode = 7;
 +
        set criterion=likelihood;
 +
        lset nst=6 rmatrix=estimate rates=gamma shape=estimate;
 +
        lscores all;
 +
        agree all;
 +
    end;
  
=== Compiling FastTree.c ===
+
Here, <tt>gettrees</tt> command reads the two tree files (RAxML_bestTree.FULL, and 50.best2.tre) you generated using the programs RaxML and Garli. The <tt>mode=7</tt> tells PAUP to keep the RaxML tree (RAxML_bestTree.FULL) in the memory while loading the Garli tree (50.best.tre). You should have two tree files and a data file (rbcL50.nex) in the same folder to execute combined.nex. The comment [&U] in the Garli tree file tells PAUP that the tree is unrooted. Therefore, to compare it with RAxML tree (which does not contain any rooting information in the file) you can add the command <tt>deroot</tt> (after gettrees file=RAxML_bestTree.FULL;).
To compile, follow the directions on the FastTree web site by typing the following at the prompt:
+
gcc -lm -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c
+
While gcc is working, read the following breakdown of the options we've given it:
+
* '''-lm''' tells gcc that it should link in the math library (this will be necessary for any program that does anything remotely resembling math, such as using the log or exp functions)
+
* '''-O3''' tells gcc to use the highest level of optimization (more than -O1 or -O2, for example)
+
* '''-finline-functions''' tells gcc to try to inline as many functions as possible. ''Inlining'' a function involves replacing calls to the function with the code for that function, which saves the overhead of making the function call (small but adds up if a function is called many times)
+
* '''-funroll-loops''' tells gcc to unrolll for loops if possible. ''Unrolling'' a loop means replacing a loop over, say, the four bases with four separate pieces of code, one for each of the four bases. This allows compiler optimizations that would not otherwise be possible.
+
* '''-Wall''' tells gcc to show all warnings
+
* '''-o FastTree''' tells gcc to name the resulting executable file "FastTree"
+
* '''FastTree.c''' tells gcc that the source code is in the file "FastTree.c"
+
  
=== Preparing to run qsub ===
+
Which tree has the best log-likelihood? You will probably find that GARLI's likelihood is slightly different than RAxML, but both will be nearly the same value. Both of these approaches will give different answers if you run them multiple times under different random number seeds, so you should probably do several replicates (or a bootstrap analysis) before making anything too momentous of the results.
  
If you have created FastTree in your home directory, create a new directory named <tt>ftree</tt> and move the <tt>FastTree</tt> executable into it. Also, copy the <tt>rbcL50.dat</tt> file into the fasttree directory as well.
+
The last command given to PAUP was "agree all". This computes an agreement subtree from the results of the two analyses.  
cd $HOME
+
mkdir ftree
+
mv FastTree ftree
+
cp garlirun/rbcL50.dat ftree
+
  
Now create a qsub script named <tt>goftree</tt> using pico:
+
<div style="background-color: #ddddff">Can you figure out from PAUP's output how many taxa (out of the 50 total) it had to omit in order to find an agreement subtree? {{title|5 taxa were deleted, 45 taxa are in the agreement subtree|answer}}</div>
#$ -o junk3.txt -j y
+
cd $HOME/ftree
+
./FastTree -gtr -nt -gamma rbcL50.dat > output.txt
+
  
Before you run this file, there are a few points you should note. Note first the "./" before the name of the executable file. This is often necessary when you are invoking a program you compiled yourself. The problem is that the system does not look in the current directory for programs! Try typing <tt>which FastTree</tt> while inside the <tt>ftree</tt> directory for example; the system will respond "FastTree: Command not found." -- it doesn't see the FastTree executable even though it is right under its nose! By prefacing the name of the executable by "./", you are explicitly telling it to use the FastTree executable inside the current directory. Try typing this now just to check that this works:
+
== Challenge ==
./FastTree
+
Modify your <tt>combined.nex</tt> file above to estimate parameters under the following models. Create a table like that below and fill in the maximized log-likelihoods and parameter estimates reported by PAUP*. Note that you can specify <tt>lscores 1</tt> to use just the RAxML tree (no need to carry out this exercise on both trees).
FastTree will run, but since you haven't given it anything to work on, it simply spits out a list of command line options.
+
{|border=1 cellpadding=8
 
+
| Model || maximized log-likelihood || parameter estimates
Note also that I have asked you to add the <tt>-gamma</tt> option. This will slow down FastTree, but will allow us to compare log-likelihood values with GARLI and RaxML.
+
|-
 +
| GTR    ||  || (nothing here)
 +
|-
 +
| GTR+I ||    || pinvar=?
 +
|-
 +
| GTR+G ||  || shape=?
 +
|-
 +
| GTR+I+G ||  || pinvar=?,shape=?
 +
|-
 +
| GTR+SS3 ||  || firstpos=?, secondpos=?, thirdpos=?
 +
|-
 +
| GTR+SS2 ||  || firstsecondpos=?, thirdpos=?
 +
|}
 +
The first 4 models are straightforward and you can refer to the previous lab (Likelihood) to learn how to estimate pinvar and shape if you've forgotten. The SS3 and SS2 stand for "site-specific rates (each of the 3 codon positions is a separate category)" and "site-specific rates (1+2 codon positions in one category, 3rd codon positions in the second category)". Search the Frequently Asked Questions page at [http://paup.phylosolutions.com/ the PAUP web site] for the term "site-specific" to find out how to set up PAUP* to perform a site-specific rates analysis.  
  
Go ahead and submit your qsub script:
+
<div style="background-color: #ddddff">Which rate heterogeneity submodels provide the largest boost to the likelihood?</div>
qsub goftree
+
<div style="background-color: #ddddff">Why do pinvar and shape change the way they do when estimated jointly versus separately?</div>
  
 
[[Category:Phylogenetics]]
 
[[Category:Phylogenetics]]

Latest revision as of 19:32, 9 February 2018

Adiantum.png EEB 5349: Phylogenetics

Goals

Continue learning anout ML inference and introduce you to two programs (GARLI and RAxML) designed specifically for maximum likelihood analyses on a large scale (hundreds of taxa).

The data

GARLI best tree under codon model
The data that we will use today comprises 50 rbcL gene sequences from various green plants: 5 green algae, 5 bryophytes (mosses, liverworts, hornworts), 5 ferns, 9 gymnosperms, and 26 angiosperms (flowering plants; including some you will recognize such as oregano, coffee, tea, spinach, poison ivy, rice and chocolate). The tree on the right shows the tree I obtained from GARLI using the GY94 codon model, color-coded so that bryophytes are yellow, ferns are green, gymnosperms are blue and angiosperms are pink. This history of green plants shows several key innovations: embryos are what gave the first bryophytes an edge over aquatic algae on land; branched sporophytes and vascular tissues allowed the first ferns to grow taller and disperse more spores compared to their bryophyte ancestors; seeds and pollen were the big inventions that led to the rise of gymnosperms; and of course flowers allowed efficient pollination by insects and led the the diversification of the angiosperms.

Part A: Starting a GARLI run on the cluster

GARLI (Genetic Algorithm for Rapid Likelihood Inference) is a program written by Derrick Zwickl for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). GARLI now (as of version 2.01) gives you considerable choice in substitution models: GTR[+I][+G] or codon models for nucleotides, plus several choices for amino acids. The genetic algorithm (or GA, for short) search strategy used by GARLI is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as with all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree. By default, GARLI will conduct two independent searches. If you have a multicore processor (newer Intel-based Macs and PCs are duo-core), GARLI can take advantage of this and use all of your CPUs simultaneously.

Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but has the advantage that you will be able to analyze it several times using both GARLI and RAxML within a lab period. Instead of each of us running GARLI several times, we will each run it once and compare notes at the end of the lab.

Preparing the GARLI control file

Like many programs, GARLI uses a control file to specify the settings it will use during a run. Most of the default settings are fine, but you will need to change a few of them before running GARLI.

Obtain a copy of the control file

The first step is to obtain a copy of the GARLI default control file. Go to the GARLI download page and download a version of GARLI appropriate for your platform (Mac or Windows). The only reason you are downloading GARLI is to obtain a copy of the default control file so don't actually install GARLI on your computer; you should use the cluster for all your GARLI runs.

Once you have downloaded and unpacked GARLI on your computer, make a copy of the garli.conf.nuc.defaultSettings file and rename the copy garli.conf, then open it in your text editor.

Editing garli.conf

You will only need to change four lines.

Specify the data file name (note the capital L)

datafname = rbcL50.nex

Specify the prefix for output files

ofprefix = 50

The ofprefix is used by GARLI to begin the name of all output files. I usually use something different than the data file name here. That way, if you eventually want to delete all of the various files that GARLI creates, you can just say rm -f 50* ("remove all files beginning "50", the -f means "force" i.e. don't ask if it is ok, just do it) without wiping out your data file as well! (Sounds like the voice of experience, doesn't it?!)

Do only one search replicate

searchreps = 1

Specify no invariable sites

invariantsites = none

This will cause GARLI to use the GTR+G model rather than the GTR+I+G model, which will facilitate comparisons with RAxML.

Save the garli.conf file when you have made these changes.

The tip of the GARLI iceberg

As you can see from the number of entries in the control file, we are not going to learn all there is to know about GARLI in one lab session. One major omission is any discussion about bootstrapping, which is very easy to do in GARLI: just set bootstrapreps to some number other than 0 (e.g. 100) in your garli.conf file (but don't do this today). I encourage you to download and read the excellent GARLI manual, especially if you want to use amino acid or codon models.

Log into the cluster

Log into the cluster using the command:

ssh bbcsrv3.biotech.uconn.edu

Go back to the Phylogenetics: Bioinformatics Cluster lab if you've forgotten some details.

Create a folder and a script for the run

Create a directory named garliraxml inside your home directory and use your favorite file transfer method (scp, psftp, Fugu, FileZilla, etc.) to get garli.conf into that directory.

Now download the data file into the garliraxml directory:

curl http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/rbcL50.nex > rbcL50.nex

Finally, create the script file you will hand to the qsub command to start the run. Use the nano editor to create a file named gogarli inside the garliraxml directory with the following contents:

#$ -cwd
#$ -S /bin/bash
#$ -o out.txt
#$ -e err.txt
#$ -m ea
#$ -M your.name@uconn.edu
#$ -N jobname
garli garli.conf

The lines starting with #$ represent commands that qsub understands. Any line starting with a hash (#) is interpreted as a comment by all linux/unix interpreters, but the extra dollar sign ($) immediately after the hash is used by the qsub program as a flag, telling it that what follows should be interpreted as a command. All of these could be entered on the qsub command line as well, but it is easy to forget something unless you put it in a script like this.

The command #$ -cwd tells qsub to put output files in the current working directory, the directory in which it was invoked (which will be the garliraxml directory).

The command #$ -S /bin/bash tells qsub to use the bash program to interpret the script. You have already been using the bash program; it interprets the commands you type when you are logged into the cluster. This qsub command just tells qsub to use the same program to interpret commands in the gogarli file.

The command #$ -o out.txt tells qsub to save any output into the file out.txt.

The command #$ -e err.txt tells qsub to save any error messages into the file err.txt.

The command #$ -m ea tells qsub to send you an email when your job ends (e) or aborts (a). The command #$ -M your.name@uconn.edu tells qsub what your email address is (be sure to specify your own email address here). This is an extremely helpful feature! This feature may not work if you provide a non-uconn email address (feel free to try your gmail address and let me know if it works).

Finally, the -N command provides a job name that qsub will use to identify your job. This is entirely optional, but I find it very helpful to give short names to my runs so that when I get the email I know which run has finished (most useful if you have several runs going simultaneously). Be sure to replace "jobname" with something more meaningful (keep your job name simple - no embedded spaces or punctuation, and you will find that it is better to use 6 characters or fewer, as longer job names get truncated in qstat output).

The last line starts garli.

Submit the job

Here is the command to start the job:

qsub gogarli

You should issue this command from inside the garliraxml directory, which should contain 3 files: gogarli, rbcL50.nex and garli.conf.

Check progress every few minutes using the qstat command. This run will take about 4 minutes. If you get bored, you can use this command to see the tail end of the log file that GARLI creates automatically:

tail 50.log00.log

The tail command is like the cat command except that it only shows you the last few lines of the file (which often is just what you need). If you need to see more lines at the end of the file, you can specify the number using the -n option:

tail -n 100 50.log00.log

The above command will show the last 100 lines.

Files produced by GARLI

After your run finishes, you should find these files in your garliraxml folder. Download them to your laptop and view them to answer the questions:

50.screen.log

This file saves the output that would have been displayed had you been running GARLI on your laptop. (Hover over the phrase "Paul's result" below to see what values I saw in my 50.screen.log file; your run may differ because we all started with a different seed.)

How long did this run take? (Search for "Time used" following "Final score") Paul's result
What was the log-likelihood of the best tree? (Look for "Final score") Paul's result
How many "generations" did GARLI do before it decided to stop? (Look at the line just above "Reached termination condition!") Paul's result
At which generation did it first find the tree that it finally reported? (Look at the last value in the line you used to answer the previous question.) Paul's result
Assuming each generation takes an equal amount of time, how long did it really take GARLI to find the best tree? Paul's result
Looking at the model report at the end, what two categories of substitution occur at the fastest rate? Paul's result
Also, is there a lot of rate heterogeneity in this data set? Paul's result
What is the ratio of the fastest relative rate to the slowest? Paul's result

50.log00.log

This file shows the best log-likelihood at periodic intervals throughout the run. It would be useful if you wanted to plot the progress of the run either as a function of time or generation.

50.best.tre

This is a NEXUS tree file that can be opened in FigTree, TreeView, PAUP*, or a number of other phylogenetic programs. Try using FigTree to open it. The best place to root it is on the branch leading to Nephroselmis (use the search box near the top of the FigTree window to quickly find taxa). In FigTree, click this branch and use the Reroot tool to change the rooting. I also find that trees look better if you click the Order nodes checkbox, which is inside the Trees tab on the left side panel of FigTree.

Part B: Starting a RAxML run on the cluster

Another excellent ML program for large problems is RAxML, written by Alexandros Stamatakis. This program is exceptionally fast, and has been used to estimate maximum likelihood trees for 25,000 taxa! Let's run RAxML on the same data as GARLI and compare results.

Preparing the data file

While GARLI reads NEXUS files, RAxML uses a simpler format. It is easy to use the nano editor to make the necessary changes, however. First, make a copy of your rbcL50.nex file:

cp rbcL50.nex rbcL50.dat

Open rbcL50.dat in nano and use Ctrl-k repeatedly to remove these initial lines before the taxa and their sequences:

#nexus 

begin data;
  dimensions ntax=50 nchar=1314;
  format datatype=dna gap=- missing=?;
  matrix

Add a new first line to the file that looks like this:

50 1314

Now use the down arrow to go to the end of the file and remove the last two lines (again using Ctrl-k):

;              
end;

Save the file using Ctrl-x and you are ready to run RAxML!

The tip of the RAxML iceberg

As with GARLI, RAxML is full of features that we will not have time to explore today. Try typing raxml -h at the linux command prompt to see a brief description of all the available options. The manual does a nice job of explaining all the features so I recommend reading it if you use RAxML for your own data.

Running RAxML on the cluster

Hopefully, you have created the rbcL50.dat file in your garliraxml directory. If not, go ahead and move it there. Then use nano to create a gorax script file that contains the following:

#$ -cwd
#$ -S /bin/bash
#$ -o raxout.txt
#$ -e raxerr.txt
#$ -m ea
#$ -M your.name@uconn.edu
#$ -N jobname
raxml -p 13579 -N 1 -e 0.00001 -m GTRCAT -s rbcL50.dat -n BASIC

You'll note that this is similar to the gogarli script we created earlier, but please read the explanation below before submitting the run to the cluster.

The qsub command lines are the same except that we specified raxout.txt rather than out.txt and raxerr.txt rather than err.txt (this is so that our RAxML run will not overwrite the files generated by our GARLI run).

The last line invokes raxml and requires the most explanation. First, RAxML does not use a control file like GARLI, so all options must be specified on the command line when it is invoked. Let's take each option in turn:

  • -p 13579 provides a pseudorandom number seed to RAxML to use when it generates its starting tree. It is a good idea to specify some number here so that you have the option of exactly recreating the analysis later. By the way, the "p" stands for parsimony, as this seed is used to generate the random addition sequence for constructing a starting tree using parsimony.
  • -N 1 tells RAxML to just perform one search replicate.
  • -e 0.00001 sets the precision with which model parameters will be estimated. RAxML will search for better combinations of parameter values until it fails to increase the log-likelihood by more than this amount. Ordinarily, the default value (0.1) is sufficient, but we are making RAxML work harder so that the results are more comparable to GARLI, which does a fairly thorough final parameter optimization.
  • -m GTRCAT tells RAxML to use the GTR+CAT model for the search. I will explain in lecture how the CAT model works in RAxML - it is essentially a sites-specific-rates model for rate heterogeneity, but derives the categories and relative rates from the data set itself before the search begins.
  • -s rbcL50.dat provides the name of the data file.
  • -n BASIC supplies a suffix to be appended to all output file names

Start the run by entering this from within the garliraxml directory (where your rbcL50.dat and gorax files are located):

qsub gorax

Bootstrapping with RAxML

After your first RAxML run finishes (probably within a minute), start a second, longer run to perform bootstrapping. Modify the last line of your gorax file as follows:

# raxml -p 13579 -N 1 -e 0.00001 -m GTRCAT -s rbcL50.dat -n BASIC
raxml -f a -x 12345 -p 13579 -N 100 -m GTRCAT -s rbcL50.dat -n FULL

Note that I've used a # character to comment out our previous raxml line. (Feel free to simply delete that line if you wish.) Go ahead and start this run using qsub. This one will take longer, but not as long as you might expect (about 10 minutes). It will conduct a bootstrap analysis involving 100 bootstrap replicates (-N 100) using the GTR+CAT model. The -x 12345 specifies a starting seed for the bootstrap resampling. For every 5 bootstrap replicates performed, RAxML will climb uphill on the original dataset starting from the tree estimated for that bootstrap replicate. This provides a series of searches for the maximum likelihood tree starting from different, but reasonable, starting trees. The -f a on the command line sets up this combination of bootstrapping and ML searching.

Files produced by RAxML

RAxML_info.FULL

This file contains some basic information about the run. Use this file to answer these questions:

How long did RAxML require to perform 100 bootstrap replicates? Paul's results
What is the log-likelihood of the best tree found by RAxML? Paul's results

RAxML_bestTree.FULL

This file holds the best tree found. It is not a NEXUS tree file, but simply a tree description; however, FigTree is able to open such files.

RAxML_bootstrap.FULL

This file holds the trees resulting from bootstrapping (also not NEXUS format; one tree description per line). These trees do not have branch lengths. You can open this file in FigTree and use the arrow buttons to move from one to the next.

RAxML_bipartitions.FULL

This file contains the best tree with bootstrap support values embedded in the tree description. Load this tree into FigTree. FigTree will ask you what name you want to use for the support values. Pick a name such as "bootstraps" and click Ok. Once the tree is visible, check Node labels on the left, chooose "bootstraps" (or whatever you named them) from the Display list, and increase the font size so you can see it (ok, you are probably young enough that you can still see the numbers without magnification!).

Can you tell by viewing this tree in FigTree that this is not the bootstrap majority-rule consensus tree? What evidence did you use? answer

We could have instructed RAxML to create the majority rule tree using the "­-J MR option, but by default it maps bootstrap support values onto the splits of the maximum likelihood tree topology.


Comparing GARLI and RAxML Using PAUP*

To compare the two programs, use the nano editor to create a tree file named combined.nex and specify the best tree from both programs and a paup block to compute the likelihoods of these two trees under the GTR+G model. Your combined.nex command block should contain the following:

  #nexus
       begin paup;
       exe rbcL50.nex;
       gettrees file=RAxML_bestTree.FULL;
       deroot;
       gettrees file=50.best.tre mode = 7; 
       set criterion=likelihood;
       lset nst=6 rmatrix=estimate rates=gamma shape=estimate;
       lscores all;
       agree all;
   end;

Here, gettrees command reads the two tree files (RAxML_bestTree.FULL, and 50.best2.tre) you generated using the programs RaxML and Garli. The mode=7 tells PAUP to keep the RaxML tree (RAxML_bestTree.FULL) in the memory while loading the Garli tree (50.best.tre). You should have two tree files and a data file (rbcL50.nex) in the same folder to execute combined.nex. The comment [&U] in the Garli tree file tells PAUP that the tree is unrooted. Therefore, to compare it with RAxML tree (which does not contain any rooting information in the file) you can add the command deroot (after gettrees file=RAxML_bestTree.FULL;).

Which tree has the best log-likelihood? You will probably find that GARLI's likelihood is slightly different than RAxML, but both will be nearly the same value. Both of these approaches will give different answers if you run them multiple times under different random number seeds, so you should probably do several replicates (or a bootstrap analysis) before making anything too momentous of the results.

The last command given to PAUP was "agree all". This computes an agreement subtree from the results of the two analyses.

Can you figure out from PAUP's output how many taxa (out of the 50 total) it had to omit in order to find an agreement subtree? answer

Challenge

Modify your combined.nex file above to estimate parameters under the following models. Create a table like that below and fill in the maximized log-likelihoods and parameter estimates reported by PAUP*. Note that you can specify lscores 1 to use just the RAxML tree (no need to carry out this exercise on both trees).

Model maximized log-likelihood parameter estimates
GTR (nothing here)
GTR+I pinvar=?
GTR+G shape=?
GTR+I+G pinvar=?,shape=?
GTR+SS3 firstpos=?, secondpos=?, thirdpos=?
GTR+SS2 firstsecondpos=?, thirdpos=?

The first 4 models are straightforward and you can refer to the previous lab (Likelihood) to learn how to estimate pinvar and shape if you've forgotten. The SS3 and SS2 stand for "site-specific rates (each of the 3 codon positions is a separate category)" and "site-specific rates (1+2 codon positions in one category, 3rd codon positions in the second category)". Search the Frequently Asked Questions page at the PAUP web site for the term "site-specific" to find out how to set up PAUP* to perform a site-specific rates analysis.

Which rate heterogeneity submodels provide the largest boost to the likelihood?
Why do pinvar and shape change the way they do when estimated jointly versus separately?