Phylogenetics: RevBayes Lab
|EEB 5349: Phylogenetics|
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using RevBayes. There are other programs that are currently more popular than RevBayes for doing this (notably BEAST2), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.|
Login to Xanadu
Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash
Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9 module load paup/4.0a-166 module load RevBayes/xxx
Create a directory
Use the unix mkdir command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory mkdir rblab
Simulating and analyzing under the strict clock model
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about substitution rates or divergence times per se; they contain information about the number of substitutions that have occurred, and the number of substitutions is the product of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.
We will each use a different random number seed, so we should all get slightly different answers.
Simulate a tree
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):
- specify that you want to generate a rooted tree by typing 2
- specify 20 species
- specify 1 tree and a random number seed of your choosing
- specify 1 to answer yes to the question about wanting branch lengths
- specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of expected height 1.
You should now find a tree description in the file evolver.out. Rename this file tree.txt and edit it so that it contains only the tree description on one line.
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named control.dat with the following contents (2 lines require modification: seed and tree description):
2 seed goes here 20 10000 1 -1
tree description goes here
4 5 0 0 0.1 0.2 0.3 0.4
Here's what each of those lines does (consult the evolver section of the PAML manual for more info about each option):
- line 1: 2 specifies that we want the output as a nexus file
- line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
- line 3: 20 taxa, 10000 sites, 1 data set
- line 4: -1 says to use the branch lengths in the tree description
- line 5: tree description: paste in the tree description you generated from the first evolve run here
- line 6: 4 specifies the HKY model
- line 7: set kappa equal to 5
- line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
- line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat
You should now find a file named mc.nex containing the sequence data. You will need to manually edit this file and insert the #nexus at the beginning.
Use RevBayes to estimate the birth rate and clock rate
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!
Create the strict.Rev file
Create a new file named strict.Rev and add the following to it: I'll provide some explanation below the code block.
# Load data and tree D <- readDiscreteCharacterData(file="mc.nex") T <- readTrees("tree.txt") taxa <- T.taxa() # Initialize move (nmoves) and monitor (nmonitors) counters nmoves = 1 nmonitors = 1 # Birth-death tree model death_rate <- 0.0 birth_rate ~ dnExponential(0.01) birth_rate.setValue(1.0) diversification := birth_rate - death_rate moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0) sampling_fraction <- 1.0 root_time <- T.rootAge() timetree ~ dnBDP(lambda = birth_rate, mu = death_rate, rho = sampling_fraction, rootAge = root_time, samplingStrategy = "uniform", condition = "nTaxa", taxa = taxa) timetree.setValue(T)
Note that we are assigning only the first tree in trees.txt to the variable T (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the  to select the first anyway).
The functions beginning with dn (e.g. dnExponential and dnBDP) are probability distributions. Thus, birth_rate is a parameter that is assigned an Exponential prior distribution having rate 0.01, and timetree is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.
The setValue function sets the starting value of a parameter that is allowed to vary.
Each parameter in the model requires a mechanism to propose changes to its value. These are called moves. A vector of moves has been created for you, so you need only add to it. The variable nmoves keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable nmoves so that new moves will not overwrite previously defined moves. This increment is performed by the ++ in nmoves++. The fact that the ++ follows nmoves means that nmoves will be incremented after its value is used. If we had used ++nmoves instead, nmoves would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
- Stochastic nodes are exemplified by birth_rate and timetree; they can be identified by the tilde (~) symbol used to assign a prior distribution.
- Constant nodes are exemplified by death_rate, sampling_fraction, and root_time; they can be identified by the assignment operator <- that fixes their value to a constant.
- Deterministic nodes are exemplified by diversification; they can be identified by the assignment operator :=. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.