Phylogenetics: RevBayes Lab
|EEB 5349: Phylogenetics|
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using RevBayes. There are other programs that are currently more popular than RevBayes for doing this (notably BEAST2), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is very explicitly defined in RevBayes.|
Login to Xanadu
Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash
Once you are transferred to a free node, type
module load RevBayes/xxx
Create a directory
Use the unix mkdir command to create a directory to play in today:
cd ~ # you can omit this if you are already in your home directory mkdir rblab
Downloading and compiling indelible
We will use the program indelible to simulate trees and data. Start by filling out the web form and downloading the software from the indelible web site.
Transfer the INDELibleV1.03.tar.gz file to the Xanadu cluster and unpack it using tar:
tar zxvf INDELibleV1.03.tar.gz
Navigate into the folder INDELibleV1.03/src and enter the following command to compile the program:
g++ -o indelible -O4 indelible.cpp
Once this command has finished, you will find a file named 'indelible' in that same src directory. Move that file to your rblab folder as follows:
cd ~ mv INDELibleV1.03/src/indelible rblab
Simulating and analyzing under the strict clock model
Divergence time analyses are the most tricky type of analysis we will do in this course. That's because the sequences do not contain information about substitution rates or divergence times per se; they contain information about the number of substitutions that have occurred, and the number of substitutions is the product of rate and time. Thus, maximum likelihood methods cannot separate rates from times; this requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.
Let's simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, and the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.
Creating a control file for indelible
Indelible requires a control file specifying everything it needs to know to perform your simulation.
Create a control file named control.txt with the following contents:
[TYPE] NUCLEOTIDE 1 [SETTINGS] [output] NEXUS [randomseed] 13579 [MODEL] mymodel [submodel] HKY 5 // HKY model with kappa = 5 [rates] 0.0 0.5 10 // pinvar. gamma shape, number of categories [statefreq] 0.3 0.2 0.3 0.2 // T, C, A, G [TREE] mytree [rooted] 20 3.0 0.0 1.0 2.0 // ntaxa, birth rate, death rate, sampling fraction, clock rate [treedepth] 1.0 // tree rescaled to have this depth (root to tip time) [PARTITIONS] mypartition [mytree mymodel 10000] // 10000 sites using mytree and mymodel [EVOLVE] mypartition 1 strict // 1 replicate using mypartition saved to file strict.nex
These options are explained in the [online indelible manual], but here is a summary of what these commands do:
- a tree of 20 taxa is generated using a birth-death process with birth rate 3 and death rate 0 (i.e. pure birth).
- the clock rate is set to 2.0 substitutions per unit time per site
- the rooted (ultrametric) tree is rescaled to have height 1 from root to tip
- the substitution model is HKY+G with shape 0.5, kappa=5 and state frequencies equal to A=0.3, C=0.2, G=0.2, and T=0.3
- 10000 sites are simulated and the output is saved to the file strict.nex