|EEB 5349: Phylogenetics|
|In this lab you will learn how to use the program FDPPDiv, written by Tracy Heath. FDPPDiv offers two alternative approaches to divergence time estimation. The DPPDiv part refers to the Dirichlet Process Prior (DPP) model for divergence time estimation, and the F prefix (for Fossil) refers to the new Fossil Birth-Death approach. You will also learn a useful skill, how to download, compile, and run a C++ program from within your home directory on the cluster.|
Download the C++ source code for FDPPDiv
First, login to the cluster and use qlogin to get off the head node and begin a session on a compute node.
Use the curl (copy url) program to download the source code for FDPPDiv:
curl -L https://github.com/trayc7/FDPPDIV/archive/v1.2.tar.gz > fdppdiv-1.2.tar.gz
The curl manual page - use the command man curl to view, and to exit type :q (yes, that's a colon followed by a lower case letter q) - says this about the -L (location) option: "If the server reports that the requested page has moved to a different location..., this option will make curl redo the request on the new place." It is nearly always a good idea to include the -L option with curl, as many web sites for software perform redirects.
The curl command copies the file v1.2.tar.gz from the specified url and saves it in your home directory as the file fdppdiv-1.2.tar.gz. There is nothing wrong with saving the file under the same name, but the original name v1.2.tar.gz doesn't give us any hint about what is included (other than that it is version 1.2), so I thought it best to add the program name to the file name when it was saved.
The extension tar.gz implies that this is a compressed archive that will unpack into a directory containing many files. The abbreviation tar refers to the tar command, which we will use next to unpack the archive. The command name "tar" stands for "tape archive", which belies the age of this program, which was used in the old days to concatenate all files in a directory into a single file that could be saved onto a magnetic tape. The tar command has survived, whereas magnetic tapes have given way to hard drives.
The gz part of the extension tar.gz tells you that the file is compressed. The tar command just concatenates files, whereas the gzip program compresses a file, adding the suffix gz to the name.
You could uncompress the fdppdiv-1.2.tar.gz to create the (larger) file fdppdiv-1.2.tar (often called a "tarball"), and then in a second operation you could use the tar command to "untar" the tarball. We will instead use tar to both uncompress and untar the file as follows:
tar zxvf fdppdiv-1.2.tar.gz
The option "z" tells tar to uncompress the file, the option "x" tells tar to extract (untar) the resulting tar archive, the option "v" tells tar to be verbose and tell us what is going on (i.e. list each file name as it is extracted), and the option "f" says that the file name to extract is coming next on the command line. You should be sure "f" is the last option given to tar before the file name for obvious reasons. If you just had an uncompressed tarball to extract, you could (and should) leave off the "z" switch.
After executing the tar command, you should find that you have a new directory in your home directory named FDPPDIV-1.2.
cd into that directory and you will find two subdirectories (example and src) as well as a README file.
cat the README file: it gives some literature citations and some web sites you might be interested in visiting.
Now cd into the src directory to prepare for the next step.
What you've downloaded is not the FDPPDiv program, but the C++ source code that can be used to create the FDPPDiv program. Computer software comes in many shapes and sizes. In scripting languages like Python and R, the source code is the program. You pass these "scripts" to the Python or R interpreter and they translate them on the fly into binary machine code that the computer understands (1s and 0s). Languages like Java are intermediate: Java source code gets compiled into something like machine code (called bytecode) that can be quickly converted into machine code on whatever operating system it is run under. Finally, there are the compiled languages (Fortran, C, C++, Pascal, etc.). Programs written in these languages must be converted completely (compiled) to machine code before they can be executed. The benefit of compiled languages is that they produce the fastest-running programs. The downside is that they are less convenient: any change made to the source code requires recompilation and the result will run on only one platform (i.e. Windows, Mac, Linux). Because of the operating system specificity, authors of programs that must be compiled often just distribute the source code and let users compile it on whatever system they wish to run it on. That is the case with FDPPDiv.
You should now be in the directory ~/FDPPDIV-1.2/src (the tilde ~ is short for your home directory). If the pwd command says you are somewhere else, navigate to that directory now and type the following:
This will compile one version of FDPPDiv: there are several versions that could be compiled depending on the capabilities of the compiler and processor, and some of these cannot be built on our cluster because the processors do not have the appropriate technology.
The make command looks for a file named Makefile in the current directory that contains instructions on how to build the software. As make runs, you will see many lines that start with g++:
g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops -c -o dppdiv.o dppdiv.cpp g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops -c -o Alignment.o Alignment.cpp g++ -DHAVE_CONFIG_H -O2 -fomit-frame-pointer -funroll-loops -c -o MbEigensystem.o MbEigensystem.cpp . . . g++ -o dppdiv-seq-sse -O2 -msse3 -D_TOM_SSE3 dppdiv.o Alignment.o MbEigensystem.o MbMath.o MbRandom.o MbTransitionMatrix.o Mcmc.o Parameter.o Parameter_basefreq.o Parameter_exchangeability.o Parameter_rate.o Parameter_shape.o Parameter_tree.o Parameter_cphyperp.o Parameter_treescale.o Parameter_speciaton.o Parameter_expcalib.o Calibration.o Model.o Model_likelihood-seq-sse.o
Each of these lines represents a command that you could run yourself; make just automates the process for you. The g++ command is the Gnu C++ compiler, which is the standard C++ compiler on Linux operating systems. The first line involves compiling the source code file dppdiv.cpp into an object code file dppdiv.o. Once all source code files are compiled into their ".o" object code equivalents, these object code files are linked together by a type of program called (not surprisingly) a linker. The g++ command serves as both compiler and linker, and the last line of output that I've shown above represents the linking phase, where g++ is handed a bunch of ".o" files and is told to link them together into a single executable named dppdiv-seq-sse (the "-o" switch provides the desired name of the executable).
Once the program is compiled, you should see a file named dppdiv-seq-sse appear in the src directory.
Simulate some data
Before analyzing a real example, let's begin by analyzing a simulated data set. This is always a good idea when using a phylogenetics program for the first time. Because you know the truth in the case of simulated data, this provides a sanity check to make sure you are interpreting the output of the program correctly. Today we will simulate a tree using the Yule pure-birth process, then generate sequence data based on the simulated tree under a strict clock model using the popular Seq-Gen program, and finally analyze the simulated data set using FDPPDiv.
First, simulate a tree
I have written a Python program that simulates a Yule tree for any number of taxa and using a speciation rate that you specify. You can download it to your own laptop by clicking here, but today we will just use it on the cluster, so use curl to get it into your home directory:
curl -L http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/yuletree.py > yuletree.py