Phylogenetics: RevBayes Lab

From EEBedia
Revision as of 20:29, 19 April 2020 by Paul Lewis (Talk | contribs) (Create the strict.Rev file)

Jump to: navigation, search
Adiantum.png EEB 5349: Phylogenetics
The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using RevBayes. There are other programs that are currently more popular than RevBayes for doing this (notably BEAST2), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.

Getting started

Login to Xanadu

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules

module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

Create a directory

Use the unix mkdir command to create a directory to play in today:

cd ~  # you can omit this line if you are already in your home directory
mkdir rblab

Simulating and analyzing under the strict clock model

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about substitution rates or divergence times per se; they contain information about the number of substitutions that have occurred, and the number of substitutions is the product of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

PAML evolver

Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

Simulate a tree

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

  • specify that you want to generate a rooted tree by typing 2
  • specify 20 species
  • specify 1 tree and a random number seed of your choosing
  • specify 1 to answer yes to the question about wanting branch lengths
  • specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of expected height 1.

You should now find a tree description in the file evolver.out. Rename this file tree.txt and edit it so that it contains only the tree description on one line.

Simulate sequences

The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named control.dat with the following contents (2 lines require modification: seed and tree description):

seed goes here
20 10000 1

tree description goes here

0 0 
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the PAML manual for more info about each option):

  • line 1: 2 specifies that we want the output as a nexus file
  • line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
  • line 3: 20 taxa, 10000 sites, 1 data set
  • line 4: -1 says to use the branch lengths in the tree description
  • line 5: tree description: paste in the tree description you generated from the first evolve run here
  • line 6: 4 specifies the HKY model
  • line 7: set kappa equal to 5
  • line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
  • line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".

evolver 5 control.dat

You should now find a file named mc.nex containing the sequence data. You will need to manually edit this file and insert the #nexus at the beginning.

Use RevBayes to estimate the birth rate and clock rate

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

Create the strict.Rev file

Create a new file named strict.Rev and add the following to it for starters:

# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
T <- readTrees("tree.txt")[1]
taxa  <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1
# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate, 
                mu = death_rate, 
                rho = sampling_fraction, 
                rootAge = root_time, 
                samplingStrategy = "uniform", 
                condition = "nTaxa", 
                taxa = taxa)