Difference between revisions of "Phylogenetics: Large Scale Maximum Likelihood Analyses"
(→Editing garli.conf) |
(→Log into the cluster) |
||
Line 42: | Line 42: | ||
=== Log into the cluster === | === Log into the cluster === | ||
− | + | Log into the cluster using the command: | |
+ | ssh bbcxsrv1.biotech.uconn.edu | ||
+ | Go back to the [[Phylogenetics: Bioinformatics Cluster]] lab if you've forgotten some details. | ||
+ | |||
+ | === Create a folder and a script for the run === | ||
+ | Create a directory named <tt>garlirun</tt> inside your home directory and use your favorite file transfer method (scp, psftp, Fugu, FileZilla, etc.) to get <tt>garli.conf</tt> into that directory. | ||
+ | |||
+ | Now download the data file into the <tt>garlirun</tt> directory: | ||
+ | curl http://hydrodictyon.eeb.uconn.edu/eeb5349/rbcL50.nex > garlirun | ||
+ | |||
+ | Finally, create the script file you will hand to the <tt>qsub</tt> command to start the run. Use the pico editor to create a file named <tt>gogarli</tt> in your home directory with the following contents: | ||
+ | #$ -o junk.txt -j y | ||
+ | cd $HOME/garlirun | ||
+ | garli garli.conf | ||
+ | |||
+ | === Submit the job === | ||
+ | |||
+ | Here is the command to start the job: | ||
+ | qsub gogarli | ||
+ | You should issue this command from your home directory, or where ever you saved the <tt>gogarli</tt> file. | ||
=== Download the data file using curl === | === Download the data file using curl === |
Revision as of 19:37, 16 February 2009
EEB 349: Phylogenetics | |
This lab explores two programs (GARLI and RAxML) designed specifically for maximum likelihood analyses on a large scale (hundreds of taxa). |
Contents
- 1 Part A: Starting a GARLI run on the cluster
- 2 Part B: Starting a RAxML run on the cluster
Part A: Starting a GARLI run on the cluster
GARLI is a program written by Derrick Zwickl for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). GARLI now (as of version 0.96) gives you considerable choice in substitution models: GTR[+I][+G] or codon models for nucleotides, plus several choices for amino acids. The genetic algorithm (or GA, for short) search strategy used by GARLI is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as with all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree.
Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but then you only have an hour or so to get this done! Instead of each of us running GARLI several times, we will each run it once and compare notes at the end of the lab.
Preparing the GARLI control file
Like many programs, GARLI uses a control file to specify the settings it will use during a run. Most of the default settings are fine, but you will need to change a few of them before running GARLI.
Obtain a copy of the control file
The first step is to obtain a copy of the GARLI default control file. Go to the GARLI download page and download a version of GARLI appropriate for your platform (Mac or Windows). For now, the only reason you are downloading GARLI is to obtain a copy of the default control file. However, because GARLI is multithreaded, you may find that it is faster to run it on your laptop than on the cluster (assuming your laptop has a multi-core Intel processor). Running on the cluster has advantages, even if it is slower. For one, you don't have to dedicate your laptop to a GARLI run for several hours.
Once you have downloaded and unpacked GARLI on your computer, copy the garli.conf.nuc.defaultSettings to a file named simply garli.conf and open it in your text editor.
Editing garli.conf
You will only need to change two lines. Change this line
datafname = zakonEtAl2006.11tax.nex
so that it looks like this instead
datafname = rbcl50.nex
Then change this line
ofprefix = nuc.GTRIG
so that it looks like this instead
ofprefix = 50taxa
The ofprefix is used by GARLI to begin the name of all output files. I usually use something different than the data file name here. That way, if you eventually want to delete all of the various files that GARLI creates, you can just say
rm -f 50taxa*
without wiping out your data file as well!
Save the garli.conf file when you have made these changes.
Log into the cluster
Log into the cluster using the command:
ssh bbcxsrv1.biotech.uconn.edu
Go back to the Phylogenetics: Bioinformatics Cluster lab if you've forgotten some details.
Create a folder and a script for the run
Create a directory named garlirun inside your home directory and use your favorite file transfer method (scp, psftp, Fugu, FileZilla, etc.) to get garli.conf into that directory.
Now download the data file into the garlirun directory:
curl http://hydrodictyon.eeb.uconn.edu/eeb5349/rbcL50.nex > garlirun
Finally, create the script file you will hand to the qsub command to start the run. Use the pico editor to create a file named gogarli in your home directory with the following contents:
#$ -o junk.txt -j y cd $HOME/garlirun garli garli.conf
Submit the job
Here is the command to start the job:
qsub gogarli
You should issue this command from your home directory, or where ever you saved the gogarli file.
Download the data file using curl
I have placed the data file (rbcL50.nex) at the following address:
http://hydrodictyon.eeb.uconn.edu/eeb349/rbcL50.nex
so you can use curl to download this file to the garlirun directory as follows:
cd $HOME/garlirun curl http://hydrodictyon.eeb.uconn.edu/eeb349/rbcL50.nex > rbcL50.nex
Three changes were made to this section:
- The last line above has been corrected (previously it did not include the > rbcl50nex part on the end)
- I changed rbcl50.nex to rbcL50.nex to avoid confusing the lower case letter L (l) for the number one (1)
- I substituted a Tomato sequence for one of the two Spinach sequences, so now there is no duplication in the rbcL50.nex data file (but note that the data set has changed)
Preparing the gogarli SGE script
Now return to your home directory (using the cd command) and create a gogarli script that will be fed to qsub to start the analysis. Use either pico or cat to create the file with this text:
#$ -o junk.txt -j y cd $HOME/garlirun Garli.94 garli.conf
This file will look very similar to the gopaup script you created in part A. The only difference is that the data and control file are in the directory garlirun (not pauprun), the name of the program is Garli.94 (this means Garli version 0.94), and GARLI expects the name of the control file (garli.conf) on the command line instead of the name of the data file (rbcl50.nex). Remember that the name of the data file was specified inside the control file.
Running GARLI
Run GARLI by issuing the qsub command:
qsub gogarli
Check progress every few minutes using the qstat command. This run will take 15 or 20 minutes. If you get bored, you can cd into the garlirun directory and use this command to see the tail end of the log file that GARLI creates automatically:
tail 50taxa.log00.log
The tail command is like the cat command except that it only shows you the last few lines of the file (which often is just what you need).
Mailing the tree to yourself
After GARLI has finished, you should download the tree file (50taxa.best.tre) using the PSFTP get command, but here is another handy trick: you can email the tree to yourself using this command (issue this from within the garlirun directory where the tree file is located):
mail paul.lewis@uconn.edu < 50taxa.best.tre
This command will send mail to paul.lewis@uconn.edu, and the body of the email message will come from the file 50taxa.best.tre!
Part B: Starting a RAxML run on the cluster
Another excellent ML program for large problems is RAxML, written by Alexandros Stamatakis.