|EEB 5349: Phylogenetics|
|This lab is still a work in progress: this note will be removed when it is complete In this lab you will learn how to use the program FDPPDiv, written by Tracy Heath. FDPPDiv offers two alternative approaches to divergence time estimation. The DPPDiv part refers to the Dirichlet Process Prior (DPP) model for divergence time estimation, and the F prefix (for Fossil) refers to the new Fossil Birth-Death approach. You will also learn a useful skill, how to download, compile, and run a C++ program from within your home directory on the cluster.|
- 1 Download the C++ source code for FDPPDiv
- 2 Compile FDPPDiv
- 3 Simulate a tree and sequence data
- 4 Run PAUP*
- 5 Run FDPPDiv
Download the C++ source code for FDPPDiv
First, login to the cluster and use qlogin to get off the head node and begin a session on a compute node.
The latest release of the FDPPDiv program can be found at its GitHub site. You could download the tar.gz file there and then move it to the cluster using, for example, Cyberduck, but instead let's download it directly into your home directory on the cluster using curl. Here is how to use the curl (copy url) program to download the source code for FDPPDiv:
curl -L https://github.com/trayc7/FDPPDIV/archive/v1.2.1.tar.gz > fdppdiv-1.2.tar.gz
The curl manual page says this about the -L (location) option: "If the server reports that the requested page has moved to a different location..., this option will make curl redo the request on the new place." It is nearly always a good idea to include the -L option with curl, as many web sites for software perform redirects.
The curl command copies the file v1.2.tar.gz from the specified url and saves it in your home directory as the file fdppdiv-1.2.tar.gz. There is nothing wrong with saving the file under the same name, but the original name v1.2.tar.gz doesn't give us any hint about what is included (other than that it is version 1.2), so I added the program name to the file name when it was saved.
What does tar.gz mean?
The extension tar.gz implies that this is a compressed archive that will unpack into a directory containing many files. The abbreviation tar refers to the tar command, which we will use next to unpack the archive. The command name "tar" stands for "tape archive", which belies the age of this program, which was used in the old days to concatenate all files in a directory into a single file that could be saved onto a magnetic tape. The tar command has survived, whereas magnetic tapes have given way to hard drives.
The gz part of the extension tar.gz tells you that the file is compressed. The tar command just concatenates files, whereas the gzip program compresses a file, adding the suffix gz to the name.
You could uncompress the fdppdiv-1.2.tar.gz to create the (larger) file fdppdiv-1.2.tar (often called a "tarball"), and then in a second operation you could use the tar command to "untar" the tarball. We will instead use tar to both uncompress and untar the file as follows:
tar zxvf fdppdiv-1.2.tar.gz
The option "z" tells tar to uncompress the file.
The option "x" tells tar to extract (untar) the resulting tar archive.
The option "v" tells tar to be verbose and tell us what is going on (i.e. list each file name as it is extracted).
The option "f" says that the file name to extract is coming next on the command line. You should be sure "f" is the last option given to tar before the file name for obvious reasons.
After executing the tar command, you should find that you have a new directory in your home directory named FDPPDIV-1.2.
cd into that directory and you will find two subdirectories (example and src) as well as a README file.
cat the README file: it gives some literature citations and some web sites you might be interested in visiting.
Now cd into the src directory to prepare for the next step.
Source code vs. executable programs
What you've downloaded is not the FDPPDiv program, but the C++ source code that can be used to create the FDPPDiv program. Computer software comes in many shapes and sizes.
In scripting languages like Python and R, the source code is the program. You pass these "scripts" to the Python or R interpreter and they translate them on the fly into binary machine code that the computer understands (i.e. a series of 1s and 0s).
Languages like Java are intermediate: Java source code gets compiled into something like machine code (called bytecode) that can be quickly converted into machine code on whatever operating system it is run under.
Finally, there are the compiled languages (Fortran, C, C++, Pascal, etc.). Programs written in these languages must be converted completely (compiled) to machine code before they can be executed. The benefit of compiled languages is that they produce the fastest-running programs. The downside is that they are less convenient: any change made to the source code requires recompilation and the result will run on only one platform (i.e. Windows, Mac, Linux). Because of the operating system specificity, authors of programs that must be compiled often just distribute the source code and let users compile it on whatever system they wish to run it on. That is the case with FDPPDiv.
You should now be in the directory ~/FDPPDIV-1.2/src (the tilde ~ is short for your home directory). If the pwd command says you are somewhere else, navigate to that directory now and type the following:
This will compile one version of FDPPDiv: there are several versions that could be compiled depending on the capabilities of the compiler and processor, and some of these cannot be built on our cluster because the processors do not have the appropriate technology.
The make command looks for a file named Makefile in the current directory that contains instructions on how to build the software. As make runs, you will see many lines that start with g++:
g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops -c -o dppdiv.o dppdiv.cpp g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops -c -o Alignment.o Alignment.cpp g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops -c -o MbEigensystem.o MbEigensystem.cpp . . . g++ -o dppdiv-seq-sse -O2 -msse3 -D_TOM_SSE3 dppdiv.o Alignment.o MbEigensystem.o MbMath.o MbRandom.o MbTransitionMatrix.o Mcmc.o Parameter.o Parameter_basefreq.o Parameter_exchangeability.o Parameter_rate.o Parameter_shape.o Parameter_tree.o Parameter_cphyperp.o Parameter_treescale.o Parameter_speciaton.o Parameter_expcalib.o Calibration.o Model.o Model_likelihood-seq-sse.o
Each of these lines represents a command that you could run yourself; make just automates the process for you. The g++ command is the Gnu C++ compiler, which is the standard C++ compiler on Linux operating systems. The first line involves compiling the source code file dppdiv.cpp into an object code file dppdiv.o. Once all source code files are compiled into their ".o" object code equivalents, these object code files are linked together by a type of program called (not surprisingly) a linker. The g++ command serves as both compiler and linker, and the last line of output that I've shown above represents the linking phase, where g++ is handed a bunch of ".o" files and is told to link them together into a single executable named dppdiv-seq-sse (the "-o" switch provides the desired name of the executable).
Once the program is compiled, you should see a file named dppdiv-seq-sse appear in the src directory.
Simulate a tree and sequence data
Before analyzing a real example, let's begin by analyzing a simulated data set. This is always a good idea when using a phylogenetics program for the first time. Because you know the truth in the case of simulated data, this provides a sanity check to make sure you are interpreting the output of the program correctly. Today we will simulate a tree using the Yule pure-birth process, then generate sequence data based on the simulated tree under a strict clock model using the popular Seq-Gen program, and finally analyze the simulated data set using FDPPDiv.
First, simulate a tree
I have written a Python program that simulates a Yule tree for any number of taxa and using a speciation rate that you specify. You can download it to your own laptop by clicking here, but today we will just use it on the cluster, so use curl to get it into your home directory:
curl -L http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/yuletree.py > yuletree.py
Run the program as follows to generate a Yule tree with 20 tips using a speciation rate 1:
Important: replace <seed> below with a random seed of your choice (should be a whole number greater than 0)
python yuletree.py 20 1.0 1 <seed>
Note the expected tree height and expected tree length reported. These two numbers will be important later.
The yuletree.py program saves the simulated tree in two ways:
- the yulesim.tre file is a NEXUS tree file that you should download to your laptop and view using FigTree
- the yulesim.txt file contains just the tree description. You can view this using cat yulesim.txt
Second, download and compile Seq-Gen
For simulating sequence data, we will use the venerable Seq-Gen program, written by Andrew Rambaut, who is also responsible for BEAUTi, FigTree, Tracer, much of BEAST, and many more useful programs. Like FDPPDiv, Seq-Gen is distributed as a compressed tar archive containing source code, so you will need to compile it in order to use it. The steps below should be somewhat familiar now that you've gone through this process once already for FDPPDiv.
Download the Seq-Gen gzipped tarball using curl:
curl -L http://tree.bio.ed.ac.uk/download.php?id=85 > sg.tar.gz
Unpack sg.tar.gz using tar:
tar zxvf sg.tar.gz
Navigate to the source directory:
Use the ls command to see what's there:
You should see a lot of files ending in ".c" and ".h", plus the Makefile. This program is written in the C language (FDPPDiv was written in C++, which is a derivative of C).
Compile Seq-Gen using make:
This will produce the following output:
cc -O3 -c seq-gen.c cc -O3 -c global.c cc -O3 -c evolve.c cc -O3 -c model.c cc -O3 -c nucmodels.c cc -O3 -c aamodels.c cc -O3 -c eigen.c eigen.c:38: warning: conflicting types for built-in function 'cexp' cc -O3 -c treefile.c cc -O3 -c progress.c cc -O3 -c gamma.c cc -O3 -c twister.c cc -O3 -o seq-gen seq-gen.o global.o evolve.o model.o nucmodels.o aamodels.o eigen.o treefile.o progress.o gamma.o twister.o -lm
Note that this time make used the C compiler cc rather than the C++ compiler (g++) used previously for compiling FDPPDiv. The last line is the link stage, where all the object files (with ".o" endings) are linked together to create the executable file seq-gen. (You can generally ignore warnings such as the one reported here; if there is an error, however, the linker will not be able to generate the executable.)
You should now see the executable seq-gen in the source directory. Let's use it now to simulate sequence data.
Third, simulate the sequence data
Move the executable to a new directory named sg in your home directory (the tilde ~ can always be used as an alias for your home directory):
mkdir ~/sg mv seq-gen ~/sg
Now move the yulesim.txt and yulesim.tre files you generated earlier to this same directory (I am assuming it is still in your home directory):
cd ~ mv yulesim.txt sg
Seq-Gen comes with excellent documentation (download the contents of the documentation directory to your laptop and open Seq-Gen.Manual.html with a web browser to view it), but for the sake of time I will spare you the effort of figuring out how to use the program. Run seq-gen as follows:
./seq-gen -mHKY -l5000 -n1 -p1 -s0.1 -f0.25,0.25,0.25,0.25 -t2.0 -z1226548 -op < yulesim.txt > simulated.dat
Here is a summary of the options we used:
- -mHKY means use the HKY model
- -l5000 means generate sequences of length 5000 sites
- -n1 means generate just 1 simulated data set (-n10 would generate 10 data sets)
- -p1 means place all data in one partition subset (i.e. don't partition)
- -s0.1 means use 0.1 as the tree scaling factor (this value effectively multiplies all edge lengths by 0.1)
- -f0.25,0.25,0.25,0.25 specifies the nucleotide frequencies (all equal in this case)
- -t2.0 specifies the transition/transversion ratio (this is not the rate ratio kappa)
- -z1226548 specifies the pseudorandom number seed to use
- -op specifies that the output file should be in phylip format, which is what FDPPDiv expects (use -on for nexus, but not for this exercise!)
- < yulesim.txt provides the tree description to seq-gen (the redirection operator < makes it appear to seq-gen as if the tree description were typed in by you)
- > simdata.nex saves the output of seq-gen to the file simdata.nex (without this redirection, seq-gen would spit out the sequence data to the screen, which is not very helpful)
After running seq-gen, you should now see the file simdata.nex in your ~/sg directory.
- Why did we scale the tree by 0.1?
Before trying FDPPDiv, let's first run PAUP* to see what maximum likelihood says about our simulated data.
Be sure that the yulesim.tre file you generated using yuletree.py is in the same directory (~/sg) as the simulated.dat file, then create a paup command file named paup.nex (or whatever you prefer) in that same directory containing the following:
#nexus begin paup; tonexus from=simulated.dat to=simulated.nex format=phylip replace; exe simulated.nex; set criterion=likelihood; lset clock nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate pinvar=estimate; gettrees file=yulesim.tre; describe 1 / brlens plot=phylogram; end;
This will first convert the phylip-formatted simulated.dat file into a nexus file, then PAUP will execute that nexus file. It will load in the tree you used for the simulation and estimate parameters on that tree using a strict molecular clock and the HKY+I+G model. The describe command will output a list of branch lengths and parameter values, allowing us to check how accurately parameters can be estimated from our simulated data.
Run PAUP* as follows (you should still be in a qlogin session):
- What is the tree length reported by PAUP?
- What tree length did you expect?
- What are the base frequencies reported by PAUP?
- What base frequencies did you expect?
- What is the transition/transversion ratio reported by PAUP?
- What transition/transversion ratio did you expect?
- What is the proportion of invariable sites and gamma shape reported by PAUP?
- What proportion of invariable sites and gamma shape did you expect?
Assuming you are still in the ~/sg directory, copy (or move) the dppdiv-seq-sse file there:
cd FDPPDIV-1.2/src mv dppdiv-seq-sse ~/sg cd ~/sg
Run dppdiv-seq-sse once with the command line option -h to show the program options:
Among the output you should see a list of options (I've highlighted in bold some things we might want to specify):
Here are the available options that you can change (default values are in ): -h : print this menu ** -hf : display example file formats ** -in : Input file name ** -out : output file name prefix ** -tre : tree file name ** -pm : prior mean of number of rate categories [= 1.0] -ra : shape for gamma of rates [= 2.0] -rb : scale for gamma of rates [= 4.0] -hsh : shape for gamma hyper prior on alpha concentration parameter [= 2.0] -hsc : scale for gamma hyper prior on alpha concentration parameter [calculated from prior mean on categories] -n : Number of MCMC cycles [= 1000000] -pf : print frequency [= 100] -sf : sample frequency [= 100] -s1 : seed 1 (use this if you only pass in one seed) -s2 : seed 2 -ubl : use input branch lengths -snm : single node move is turned on -rdn : random-order node moves -offm : turn off a given move (1,2,3,4,5,6) -rnp : return 0.0 for lnl, run under prior -cal : file name with internal node calibrations -npr : 1=uniform, 2=yule, 3=cbd, 4=cbd fix with given vals -bdr : initial diversification rate (lambda - mu) -bda : initial relative death rate (mu / lambda) -soft : turn on soft bounds on calibrated nodes -clok : run under strict clock (and estimate substitution rate) -urg : run under uncorrelated gamma-distributed rates -exhp : all calibrated nodes are offset exponential and this also turns on the hyperprior on the exp rates -dphp : all cal nodes have a DPM hyperprior this also gets a value for the expected # of calibration clusters -ghp : hyperprior on calibrations from a gamma -mup : modify update probabilities mid run -fxm : fix some model params -ihp : run under independent hyperprior on exp cals ** required
We now have a data file (simulated.dat), a tree (yulesim.txt) and the FDPPDiv executable (dppdiv-seq-sse) in place, and the only other thing that is needed is a file describing our "fossil" calibrations. I've put fossil in quotes because we will need to make up fake fossil data. In order to create a fake fossil that makes sense, you will need to download the yulesim.tre file to your laptop and view it with FigTree. Click the scale bar checkbox to place a time scale below the tree, and check reverse axis so that the time scale starts with 0 at the tips and increases toward the root. Choose a node somewhere close to the middle. If that node is at, say, 1.5, on the scale, and if this node represents the most recent common ancestor (MRCA) of taxon G and taxon K, create a file fake_fossil.cal with the following text:
1 -t G K 1.5
This specifies that you have a single fossil calibration node, that the fossil is dated at 1.5 time units, and it is calibrating the MRCA of G and K.
We are now in position to run FDPPDiv on simdata.dat:
./dppdiv-seq-sse -in simulated.dat -tre yulesim.txt -cal fake_fossil.cal -tga -clok -n 10000 -sf 10 -pf 100 -out test_fbd