Phylogenetics: FDPPDiv

From EEBedia
Revision as of 17:32, 20 April 2014 by Paul Lewis (Talk | contribs) (Second, simulate the sequence data)

Jump to: navigation, search
Adiantum.png EEB 5349: Phylogenetics
In this lab you will learn how to use the program FDPPDiv, written by Tracy Heath. FDPPDiv offers two alternative approaches to divergence time estimation. The DPPDiv part refers to the Dirichlet Process Prior (DPP) model for divergence time estimation, and the F prefix (for Fossil) refers to the new Fossil Birth-Death approach. You will also learn a useful skill, how to download, compile, and run a C++ program from within your home directory on the cluster.

Download the C++ source code for FDPPDiv

First, login to the cluster and use qlogin to get off the head node and begin a session on a compute node.

Use the curl (copy url) program to download the source code for FDPPDiv:

curl -L https://github.com/trayc7/FDPPDIV/archive/v1.2.tar.gz > fdppdiv-1.2.tar.gz

The curl manual page - use the command man curl to view, and to exit type :q (yes, that's a colon followed by a lower case letter q) - says this about the -L (location) option: "If the server reports that the requested page has moved to a different location..., this option will make curl redo the request on the new place." It is nearly always a good idea to include the -L option with curl, as many web sites for software perform redirects.

The curl command copies the file v1.2.tar.gz from the specified url and saves it in your home directory as the file fdppdiv-1.2.tar.gz. There is nothing wrong with saving the file under the same name, but the original name v1.2.tar.gz doesn't give us any hint about what is included (other than that it is version 1.2), so I thought it best to add the program name to the file name when it was saved.

The extension tar.gz implies that this is a compressed archive that will unpack into a directory containing many files. The abbreviation tar refers to the tar command, which we will use next to unpack the archive. The command name "tar" stands for "tape archive", which belies the age of this program, which was used in the old days to concatenate all files in a directory into a single file that could be saved onto a magnetic tape. The tar command has survived, whereas magnetic tapes have given way to hard drives.

The gz part of the extension tar.gz tells you that the file is compressed. The tar command just concatenates files, whereas the gzip program compresses a file, adding the suffix gz to the name.

You could uncompress the fdppdiv-1.2.tar.gz to create the (larger) file fdppdiv-1.2.tar (often called a "tarball"), and then in a second operation you could use the tar command to "untar" the tarball. We will instead use tar to both uncompress and untar the file as follows:

tar zxvf fdppdiv-1.2.tar.gz

The option "z" tells tar to uncompress the file, the option "x" tells tar to extract (untar) the resulting tar archive, the option "v" tells tar to be verbose and tell us what is going on (i.e. list each file name as it is extracted), and the option "f" says that the file name to extract is coming next on the command line. You should be sure "f" is the last option given to tar before the file name for obvious reasons. If you just had an uncompressed tarball to extract, you could (and should) leave off the "z" switch.

After executing the tar command, you should find that you have a new directory in your home directory named FDPPDIV-1.2.

cd into that directory and you will find two subdirectories (example and src) as well as a README file.

cat the README file: it gives some literature citations and some web sites you might be interested in visiting.

Now cd into the src directory to prepare for the next step.

Compile FDPPDiv

What you've downloaded is not the FDPPDiv program, but the C++ source code that can be used to create the FDPPDiv program. Computer software comes in many shapes and sizes. In scripting languages like Python and R, the source code is the program. You pass these "scripts" to the Python or R interpreter and they translate them on the fly into binary machine code that the computer understands (1s and 0s). Languages like Java are intermediate: Java source code gets compiled into something like machine code (called bytecode) that can be quickly converted into machine code on whatever operating system it is run under. Finally, there are the compiled languages (Fortran, C, C++, Pascal, etc.). Programs written in these languages must be converted completely (compiled) to machine code before they can be executed. The benefit of compiled languages is that they produce the fastest-running programs. The downside is that they are less convenient: any change made to the source code requires recompilation and the result will run on only one platform (i.e. Windows, Mac, Linux). Because of the operating system specificity, authors of programs that must be compiled often just distribute the source code and let users compile it on whatever system they wish to run it on. That is the case with FDPPDiv.

You should now be in the directory ~/FDPPDIV-1.2/src (the tilde ~ is short for your home directory). If the pwd command says you are somewhere else, navigate to that directory now and type the following:

make dppdiv-seq-sse

This will compile one version of FDPPDiv: there are several versions that could be compiled depending on the capabilities of the compiler and processor, and some of these cannot be built on our cluster because the processors do not have the appropriate technology.

The make command looks for a file named Makefile in the current directory that contains instructions on how to build the software. As make runs, you will see many lines that start with g++:

g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops   -c -o dppdiv.o dppdiv.cpp 
g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops   -c -o Alignment.o Alignment.cpp
g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops   -c -o MbEigensystem.o MbEigensystem.cpp
.
.
.
g++ -o dppdiv-seq-sse -O2 -msse3 -D_TOM_SSE3 dppdiv.o Alignment.o MbEigensystem.o MbMath.o 
MbRandom.o MbTransitionMatrix.o Mcmc.o Parameter.o Parameter_basefreq.o Parameter_exchangeability.o 
Parameter_rate.o Parameter_shape.o Parameter_tree.o Parameter_cphyperp.o Parameter_treescale.o 
Parameter_speciaton.o Parameter_expcalib.o Calibration.o Model.o Model_likelihood-seq-sse.o

Each of these lines represents a command that you could run yourself; make just automates the process for you. The g++ command is the Gnu C++ compiler, which is the standard C++ compiler on Linux operating systems. The first line involves compiling the source code file dppdiv.cpp into an object code file dppdiv.o. Once all source code files are compiled into their ".o" object code equivalents, these object code files are linked together by a type of program called (not surprisingly) a linker. The g++ command serves as both compiler and linker, and the last line of output that I've shown above represents the linking phase, where g++ is handed a bunch of ".o" files and is told to link them together into a single executable named dppdiv-seq-sse (the "-o" switch provides the desired name of the executable).

Once the program is compiled, you should see a file named dppdiv-seq-sse appear in the src directory.

Simulate a tree and sequence data

Before analyzing a real example, let's begin by analyzing a simulated data set. This is always a good idea when using a phylogenetics program for the first time. Because you know the truth in the case of simulated data, this provides a sanity check to make sure you are interpreting the output of the program correctly. Today we will simulate a tree using the Yule pure-birth process, then generate sequence data based on the simulated tree under a strict clock model using the popular Seq-Gen program, and finally analyze the simulated data set using FDPPDiv.

First, simulate a tree

I have written a Python program that simulates a Yule tree for any number of taxa and using a speciation rate that you specify. You can download it to your own laptop by clicking here, but today we will just use it on the cluster, so use curl to get it into your home directory:

curl -L http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/yuletree.py > yuletree.py

Run the program as follows to generate a Yule tree with 20 tips using a speciation rate 1:

Important: replace <seed> below with a random seed of your choice (should be a whole number greater than 0)

python yuletree.py 20 1.0 1 <seed>

Note the expected tree height and expected tree length reported. These two numbers will be important later.

The yuletree.py program saves the simulated tree in two ways:

  1. the yulesim.tre file is a NEXUS tree file that you should download to your laptop and view using FigTree
  2. the yulesim.txt file contains just the tree description. You can view this using cat yulesim.txt

Second, simulate the sequence data

For simulating sequence data, we will use the venerable Seq-Gen program, written by Andrew Rambaut, who is also responsible for BEAUTi, FigTree, Tracer, much of BEAST, and many more useful programs. Like FDPPDiv, Seq-Gen is distributed as a compressed tar archive containing source code, so you will need to compile it in order to use it. The steps below should be somewhat familiar now that you've gone through this process once already for FDPPDiv.

Download the Seq-Gen gzipped tarball using curl:

curl -L http://tree.bio.ed.ac.uk/download.php?id=85 > sg.tar.gz

Unpack sg.tar.gz using tar:

tar zxvf sg.tar.gz

Navigate to the source directory:

cd Seq-Gen.v1.3.3/source

Use the ls command to see what's there:

ls 

You should see a lot of files ending in ".c" and ".h", plus the Makefile. This program is written in the C language (FDPPDiv was written in C++, which is a derivative of C).

Compile Seq-Gen using make:

make
cc  -O3 -c seq-gen.c
cc  -O3 -c global.c
cc  -O3 -c evolve.c
cc  -O3 -c model.c
cc  -O3 -c nucmodels.c
cc  -O3 -c aamodels.c
cc  -O3 -c eigen.c
eigen.c:38: warning: conflicting types for built-in function 'cexp'
cc  -O3 -c treefile.c
cc  -O3 -c progress.c
cc  -O3 -c gamma.c
cc  -O3 -c twister.c
cc  -O3 -o seq-gen seq-gen.o global.o evolve.o model.o nucmodels.o aamodels.o eigen.o treefile.o progress.o gamma.o twister.o -lm

Note that this time make used the C compiler cc rather than the C++ compiler (g++) used previously for compiling FDPPDiv. The last line is the link stage, where all the object files (with ".o" endings) are linked together to create the executable file seq-gen. (You can generally ignore warnings such as the one reported here; if there is an error, however, the linker will not be able to generate the executable.)

You should now see the executable seq-gen in the source directory. Let's use it now to simulate sequence data.

Move the executable to a new directory named sg in your home directory:

mkdir ~/sg
mv seq-gen ~/sg

Now move the yulesim.txt file you generated earlier to this same directory (I am assuming it is still in your home directory):

cd ~
mv yulesim.txt sg

Seq-Gen comes with excellent documentation (download the contents of the documentation directory to your laptop and open Seq-Gen.Manual.html with a web browser to view it), but for the sake of time I will spare you the effort of figuring out how to use the program. Run seq-gen as follows:

 ./seq-gen -mHKY -l5000 -n1 -p1 -s0.1 -f0.25,0.25,0.25,0.25 -t2.0 -z1226548 -op < yulesim.txt > simdata.nex

Here is a summary of the options we used:

  • -mHKY means use the HKY model
  • -l5000 means generate sequences of length 5000 sites
  • -n1 means generate just 1 simulated data set (-n10 would generate 10 data sets)
  • -p1 means place all data in one partition subset (i.e. don't partition)
  • -s0.1 means use 0.1 as the tree scaling factor (this value effectively multiplies all edge lengths by 0.1)
  • -f0.25,0.25,0.25,0.25 specifies the nucleotide frequencies (all equal in this case)
  • -t2.0 specifies the transition/transversion ratio (this is not the rate ratio kappa)
  • -z1226548 specifies the pseudorandom number seed to use
  • -op specifies that the output file should be in phylip format, which is what FDPPDiv expects (use -on for nexus, but not for this exercise!)
  • < yulesim.txt provides the tree description to seq-gen (the redirection operator < makes it appear to seq-gen as if the tree description were typed in by you)
  • > simdata.nex saves the output of seq-gen to the file simdata.nex (without this redirection, seq-gen would spit out the sequence data to the screen, which is not very helpful)

After running seq-gen, you should now see the file simdata.nex in your ~/sg directory.

Run FDPPDiv

Assuming you are still in the ~/sg directory, copy (or move) the dppdiv-seq-sse file there:

cd FDPPDIV-1.2/src
mv dppdiv-seq-sse ~/sg
cd ~/sg

We now have a data file (simulated.dat), a tree (yulesim.txt) and the FDPPDiv executable (dppdiv-seq-sse) in place, and the only other thing that is needed is a file describing our "fossil" calibrations. I've put fossil in quotes because we will need to make up fake fossil data. In order to create a fake fossil that makes sense, you will need to download the yulesim.tre file to your laptop and view it with FigTree. Click the scale bar checkbox to see the time scale below the tree and choose a node somewhere close to the middle. If that node is at, say, 50.3, on the scale, and if this node represents the most recent common ancestor (MRCA) of taxon C and taxon D, create a file fake_fossil.cal with the following text:

1
-t	C	D	50.3

This specifies that you have a single fossil calibration node, that the fossil is dated at 50.3 time units, and it is calibrating the MRCA of C and D.

We are now in position to run FDPPDiv on simdata.dat:

./dppdiv-seq-sse -in simdata.dat -tre yulesim.txt -cal fake_fossil.cal -tga -clok -n 10000 -sf 10 -pf 100 -out test_fbd