Phylogenetics: BayesTraits Lab

From EEBedia
Revision as of 16:05, 10 April 2009 by PaulLewis (Talk | contribs) (Do the tutorial)

Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
In this lab you will learn how to use the program BayesTraits, written by Andrew Meade and Mark Pagel. BayesTraits can perform several analyses related to evaluating evolutionary correlation in discrete morphological traits. This program is meant to replace the older programs Discrete and Multistate. You will learn not only how to use the program on the Windows-based PCs in the computer lab, but also how to download and use it on the cluster (the cluster is better for long runs).

We will use BayesTraits interactively for awhile on the PCs in the computer room (Part 1), then we will set up a non-interactive run on the cluster in Part 2 so that you know how to do this.

Part 1: Running BayesTraits under Windows

Download BayesTraits

Download BayesTraits from Mark Pagel's web site, click on the "Software" link, then click on the "Description and Downloads" link under "BayesTraits". Download the version specific to your platform. BayesTraits will unpack itself to a folder containing the program itself along with several tree and data files (PPI.txt, PPI.trees, Primates.txt and Primates.trees). I will hereafter refer to the folder containing these files as simply the BayesTraits folder. Go back to Mark Pagel's web site and download the manual for BayesTraits. This is a PDF file and should open in your browser window.

Download the modified example files

You will be going through the tutorial presented in the manual for the program during this lab, but there are a couple of modifications we need to make to the example data files first:

Use Primates.first.tree instead of Primates.trees

The Primates.trees file that comes with BayesTraits contains 500 trees, which makes any analysis take a very long time. We'll avoid the long waits by using a version of this file that contains only the first tree. Download Primates.first.tree and save it in your BayesTraits folder. Whenever the tutorial refers to the file Primates.trees, use Primates.first.tree instead.

Obtain the missing MatingSystem.txt file

The first part of the tutorial in the manual will not work out of the box because it assumes you have the file MatingSystem.txt, which is not included in the distribution. It turns out that the missing MatingSystem.txt is just Primates.txt with the first of the two characters deleted. I've done the modification for you, so download the MatingSystem.txt file now and save it in your BayesTraits folder.

Do the tutorial

Work through the tutorial starting on p. 10 of the BayesTraits draft manual PDF file (but only after reading the Tutorial Notes section below). The heading of the section is "Using MultiState to estimate the model of evolution and ancestral states for a binary trait". Stop when you get to the "Functional Gene Links" section (p. 18 of the manual).

Tutorial Notes

Remember throughout the tutorial to use Primates.first.tree instead of Primates.trees! Note that your output will only correspond to that of tree number 1 in the sample output from the BayesTraits manual.

BayesTraits must be run from the command line, which means you must open a command window to run the program. Simply double-clicking the BayesTraits executable file will cause it to run, but not for long! The problem is that when you double-click the program, you have no way to tell it what tree and data file to use, so it simply quits immediately. Open a Terminal window (Mac) or console window (Windows) in the directory where the BayesTraits executable is located. You can then start the program as follows:

BayesTraits Primates.first.tree MatingSystem.txt

Note however that later the tutorial switches to using the data fle Primates.txt, so at that point you should replace MatingSystem.txt with Primates.txt on the command line.

One final note: the default number of MCMC iterations is 5,050,000. This will take some time to run. For our purposes, it is ok to reduce this number. For example, to tell BayesTraits to only run for 550,000 iterations, type in the following command before you type run:

it 550000

There is a listing of all commands recognized by BayesTraits at the end of the manual.

Part 2: Running BayesTraits on the cluster

We will now switch to using the cluster to run BayesTraits. BayesTraits is not installed on the cluster, so you will need to download and unpack it into your home directory in order to use it. Using PuTTY, connect to bbcxsrv1.biotech.uconn.edu to get a command prompt.

Downloading and unpacking BayesTraits

The full URL to the OS X PPC version of BayesTraits is

http://www.evolution.reading.ac.uk/Files/BayesTraits-OSX-PPC-V1.0.tar.gz

If you had a browser open, and you typed in this URL, your browser would save the file BayesTraits-OSX-PPC-V1.0.tar.gz on your hard drive. But you are not using a browser on the cluster, you are logged in using the Secure Shell client program PuTTY.

You could download the file to your PC, then upload it to the cluster using PSFTP, but let's instead use the curl command:

 curl -o BayesTraits-OSX-PPC-V1.0.tar.gz http://www.evolution.reading.ac.uk/Files/BayesTraits-OSX-PPC-V1.0.tar.gz

This tells curl to access the specified URL (curl will stand in for a web browser) and save the resulting file as BayesTraits-OSX-PPC-V1.0.tar.gz.

Once you have the file in your home directory (use the ls command to check), you will need to unpack it using the tar command:

 tar zxvf BayesTraits-OSX-PPC-V1.0.tar.gz

The file extension .tar.gz is very common for software targeting Linux and Mac OS X systems. This extension has a very specific meaning: the .tar part means that the file is actually an archive (bundle) comprising several files saved one after the other. The .gz part means that the archive has been compressed using the gzip program. The tar command can both ungzip the archive (that's what the z in zxvf means) and separate the component files (the x in zxvf stands for extract). The v in zxvf means verbose (tar will show you what it's doing), and the f simply means that the name of the file to extract follows (i.e. BayesTraits-OSX-PPC-V1.0.tar.gz).

Once the tar command has completed, you should have a directory named BayesTraits. Use the cd command to move into that directory, then use the ls command to see what's there. You should see the same 5 files as before, except the executable (BayesTraits) does not have the .exe file name extension this time (that extension typically denotes Windows executables).

Running BayesTraits

We will do a very simple analysis just so you will know how to run BayesTraits in batch mode. Because analyses on the cluster are submitted via the qsub command, and thus no one will be present to answer those questions BayesTraits asks when it runs, we'll supply the answers in a file (arbitrarily named commands.txt) and feed that file to BayesTraits when it is invoked.

Create the commands.txt file

Create a file inside the BayesTraits directory using the pico editor named commands.txt, and save the following in the file:

# choose the model (1 = multistate, 2 = independence, 3 = dependence)
3
# choose the method (1 = ML, 2 = MCMC)
2
ratedev 10
rjhp exp 0 30
run

The lines starting with a pound sign (or hash) are comments and are optional. Otherwise, this file just presents to BayesTraits exactly what you would type if you were running the program interactively.

Create the qsub script

As you know by now, you use the qsub command to submit a job to the cluster. The advantage of using the qsub command is that it will start your job on a processor that is currently idle (assuming there are idle processors available in the cluster). Not using qsub results in your job being started on the head node of the cluster. The head node is the processor that everyone uses to interact with the cluster, and if you start a run there, it has a fair chance of simply being killed as soon as the system administrator (Jeff Lary) notices it!

The qsub command accepts the name of a script file that carries out your wishes, so we must first create this script (use pico to create and save the following in your home directory as a file named btgo, for example):

 #$ -o junk.txt -j y
 cd $HOME/BayesTraits
 ./BayesTraits Primates.trees Primates.txt < commands.txt

You have experienced these scripts before. The first line begins with the sequence of characters #$, which alerts qsub that a qsub command is coming. The command itself is the -o junk.txt -j y part, which tells qsub that you want the output of the run saved in a file named junk.txt and you would like any error message appended to this same file (the -j y part is indeed cryptic, but stands for "join [standard error to output] yes"). Be sure to delete or rename any file named junk.txt that already exists.

The second line says to cd into the BayesTraits directory (this means that the BayesTraits folder will now be the working directory, and any output files created by the program will be saved there).

The third line actually starts the program. Since we are running this on the cluster and are not so limited by time, we'll use the full Primates.trees file (containing 500 trees). Note the very important < commands.txt part, which feeds the answers to all questions to the BayesTraits executable. Also, note the period-slash (./) before the name of the program. This says that the program can be found in the current directory (one of the crazy things about unix is that the current directory is not searched by default when looking for files!).

Submit the script

Now submit the job as follows:

 qsub btgo

As before, you can periodically check to see if it is still running using the qstat command. You can use pico to look at the junk.txt file in your home directory, or look in the BayesTraits directory and examine the files created there as the program runs. A useful command is tail:

 cd $HOME/BayesTraits
 tail Primates.txt.log.txt

This shows you only the last few lines of the file listed. To show the last 100 lines, you can use the -n switch:

 tail -n 100 Primates.txt.log.txt