Difference between revisions of "Phylogenetics: RevBayes Lab"

From EEBedia
Jump to: navigation, search
(Load the PAML module)
(Simulating and analyzing under the strict clock model)
Line 27: Line 27:
 
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.
 
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.
  
 +
==== PAML evolver ====
 
Let's simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, and the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.
 
Let's simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, and the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.
  
 
+
We will use the evolver program, which is part of Ziheng Yang's PAML package:
 
  module load paml/4.9
 
  module load paml/4.9
  
Line 39: Line 40:
 
  birth rate, death rate, sampling fraction, and mutation rate (tree height)?
 
  birth rate, death rate, sampling fraction, and mutation rate (tree height)?
 
  2.6 0.0 1.0 1.0
 
  2.6 0.0 1.0 1.0
 +
 +
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.
  
 
You should now find a tree description in the file ''evolver.out''. Rename this file before running evolver again, otherwise it will be wiped out.
 
You should now find a tree description in the file ''evolver.out''. Rename this file before running evolver again, otherwise it will be wiped out.
  
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named control.dat with the following contents:
+
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (everything after and including the asterisk on each line is a comment):
  
 
  2              * 2 means paup format (mc.nex)  
 
  2              * 2 means paup format (mc.nex)  
Line 54: Line 57:
 
  0.1 0.2 0.3 0.4 * base frequencies T C A G  
 
  0.1 0.2 0.3 0.4 * base frequencies T C A G  
  
If you compare the tree specified here with the one saved in the ''evolver.out'' file, they should be identical. I simply copied the tree description and pasted it into this control.dat file.
+
If you compare the tree specified here with the one saved in the ''evolver.out'' file, they should be identical. I simply copied the tree description and pasted it into this ''control.dat'' file.
 +
 
 +
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
 +
evolver 5 control.dat
 +
 
 +
You should now find a file named ''mc.nex'' containing the sequence data. You will need to manually edit this file and insert the <tt>#nexus</tt> at the beginning. (If you want to avoid this manual step, see the information about the paupbegin, paupend, and paupblock files in the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual].)

Revision as of 17:58, 19 April 2020

Adiantum.png EEB 5349: Phylogenetics
The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using RevBayes. There are other programs that are currently more popular than RevBayes for doing this (notably BEAST2), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is very explicitly defined in RevBayes.

Getting started

Login to Xanadu

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, type

module load RevBayes/xxx

Create a directory

Use the unix mkdir command to create a directory to play in today:

cd ~            # you can omit this if you are already in your home directory
mkdir rblab

Simulating and analyzing under the strict clock model

Divergence time analyses are the most tricky type of analysis we will do in this course. That's because the sequences do not contain information about substitution rates or divergence times per se; they contain information about the number of substitutions that have occurred, and the number of substitutions is the product of rate and time. Thus, maximum likelihood methods cannot separate rates from times; this requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

PAML evolver

Let's simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, and the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will use the evolver program, which is part of Ziheng Yang's PAML package:

module load paml/4.9

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the following at the prompts:

(2) Get random ROOTED trees 
No. of species: 20
number of trees & random number seed? 1 13579
Want branch lengths from the birth-death process (0/1)? 1
birth rate, death rate, sampling fraction, and mutation rate (tree height)?
2.6 0.0 1.0 1.0

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of expected height 1.

You should now find a tree description in the file evolver.out. Rename this file before running evolver again, otherwise it will be wiped out.

The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named control.dat with the following contents (everything after and including the asterisk on each line is a comment):

2               * 2 means paup format (mc.nex) 
367891          * random number seed (odd number)
20 10000 1      * <# seqs> <# nucleotide sites> <# replicates>
-1              * <tree length, use -1 if tree has absolute branch lengths>
((F: 0.446250, (M: 0.333944, ((((Q: 0.036930, P: 0.036930): 0.000319, S: 0.037249): 0.018778, R: 0.056028): 0.162227, B: 0.218255): 0.115689): 0.112306): 0.553750, (((((D: 0.088527, G: 0.088527): 0.134213, (H: 0.017997, (T: 0.013999, O: 0.013999): 0.003998): 0.204743): 0.052342, C: 0.275081): 0.148733, (((E: 0.006469, K: 0.006469): 0.177133, (L: 0.043181, A: 0.043181): 0.140421): 0.182643, N: 0.366246): 0.057569): 0.056860, (I: 0.466020, J: 0.466020): 0.014654): 0.519325);
4               * model: 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85, 5:T92, 6:TN93, 7:REV
5               * kappa or rate parameters in model
0 0             * <alpha> <#categories for discrete gamma>
0.1 0.2 0.3 0.4 * base frequencies T C A G 

If you compare the tree specified here with the one saved in the evolver.out file, they should be identical. I simply copied the tree description and pasted it into this control.dat file.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".

evolver 5 control.dat

You should now find a file named mc.nex containing the sequence data. You will need to manually edit this file and insert the #nexus at the beginning. (If you want to avoid this manual step, see the information about the paupbegin, paupend, and paupblock files in the evolver section of the PAML manual.)