Difference between revisions of "Phylogenetics: RevBayes Lab"

From EEBedia
Jump to: navigation, search
(Simulating and analyzing under the strict clock model)
Line 4: Line 4:
 
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
 
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
 
|-
 
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is very explicitly defined in RevBayes.
+
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
 
|}
 
|}
  
Line 13: Line 13:
 
  srun --pty -p mcbstudent --qos=mcbstudent bash
 
  srun --pty -p mcbstudent --qos=mcbstudent bash
  
Once you are transferred to a free node, type
+
Once you are transferred to a free node, load the paml, paup, and revbayes modules
 +
module load paml/4.9
 +
module load paup/4.0a-166
 
  module load RevBayes/xxx
 
  module load RevBayes/xxx
  
=== Create a directory ===
+
== Create a directory ==
 
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
 
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
  cd ~           # you can omit this if you are already in your home directory
+
  cd ~ # you can omit this line if you are already in your home directory
 
  mkdir rblab
 
  mkdir rblab
  
=== Simulating and analyzing under the strict clock model ===
+
== Simulating and analyzing under the strict clock model ==
  
Divergence time analyses are the most tricky type of analysis we will do in this course. That's because the sequences do not contain information about substitution rates or divergence times per se; they contain information about the number of substitutions that have occurred, and the number of substitutions  is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; this requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.
+
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions  is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.
  
 
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.
 
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.
  
==== PAML evolver ====
+
=== PAML evolver ===
Let's simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, and the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.
+
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.
  
We will use the evolver program, which is part of Ziheng Yang's PAML package:
+
We will each use a different random number seed, so we should all get slightly different answers.
module load paml/4.9
+
 
 +
==== Simulate a tree =====
 +
 
 +
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):
  
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the following at the prompts:
+
* specify that you want to generate a rooted tree by typing 2
(2) Get random ROOTED trees
+
* specify 20 species
No. of species: 20
+
* specify 1 tree and a random number seed of ''your'' choosing
number of trees & random number seed? 1 13579
+
* specify 1 to answer yes to the question about wanting branch lengths
Want branch lengths from the birth-death process (0/1)? 1
+
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
birth rate, death rate, sampling fraction, and mutation rate (tree height)?
+
2.6 0.0 1.0 1.0
+
  
 
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.  
 
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.  
Line 45: Line 48:
 
You should now find a tree description in the file ''evolver.out''. Rename this file before running evolver again, otherwise it will be wiped out.
 
You should now find a tree description in the file ''evolver.out''. Rename this file before running evolver again, otherwise it will be wiped out.
  
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (everything after and including the asterisk on each line is a comment):
+
==== Simulate sequences ====
 +
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):
  
  2               * 2 means paup format (mc.nex)
+
  2
  367891          * random number seed (odd number)
+
  seed goes here
  20 10000 1     * <# seqs> <# nucleotide sites> <# replicates>
+
  20 10000 1
  -1             * <tree length, use -1 if tree has absolute branch lengths>
+
  -1
((F: 0.446250, (M: 0.333944, ((((Q: 0.036930, P: 0.036930): 0.000319, S: 0.037249): 0.018778, R: 0.056028): 0.162227, B: 0.218255): 0.115689): 0.112306): 0.553750, (((((D: 0.088527, G: 0.088527): 0.134213, (H: 0.017997, (T: 0.013999, O: 0.013999): 0.003998): 0.204743): 0.052342, C: 0.275081): 0.148733, (((E: 0.006469, K: 0.006469): 0.177133, (L: 0.043181, A: 0.043181): 0.140421): 0.182643, N: 0.366246): 0.057569): 0.056860, (I: 0.466020, J: 0.466020): 0.014654): 0.519325);
+
tree description goes here
  4               * model: 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85, 5:T92, 6:TN93, 7:REV
+
  4  
  5               * kappa or rate parameters in model
+
  5
  0 0             * <alpha> <#categories for discrete gamma>
+
  0 0  
  0.1 0.2 0.3 0.4 * base frequencies T C A G
+
  0.1 0.2 0.3 0.4
  
If you compare the tree specified here with the one saved in the ''evolver.out'' file, they should be identical. I simply copied the tree description and pasted it into this ''control.dat'' file.
+
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
 +
* line 1: 2 specifies that we want the output as a nexus file
 +
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
 +
* line 3: 20 taxa, 10000 sites, 1 data set
 +
* line 4: -1 says to use the branch lengths in the tree description
 +
* line 5: tree description: paste in the tree description you generated from the first evolve run here
 +
* line 6: 4 specifies the HKY model
 +
* line 7: set kappa equal to 5
 +
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
 +
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)
  
 
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
 
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
 
  evolver 5 control.dat
 
  evolver 5 control.dat
  
You should now find a file named ''mc.nex'' containing the sequence data. You will need to manually edit this file and insert the <tt>#nexus</tt> at the beginning. (If you want to avoid this manual step, see the information about the paupbegin, paupend, and paupblock files in the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual].)
+
You should now find a file named ''mc.nex'' containing the sequence data. You will need to manually edit this file and insert the <tt>#nexus</tt> at the beginning.

Revision as of 19:56, 19 April 2020

Adiantum.png EEB 5349: Phylogenetics
The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using RevBayes. There are other programs that are currently more popular than RevBayes for doing this (notably BEAST2), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.

Getting started

Login to Xanadu

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules

module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

Create a directory

Use the unix mkdir command to create a directory to play in today:

cd ~  # you can omit this line if you are already in your home directory
mkdir rblab

Simulating and analyzing under the strict clock model

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about substitution rates or divergence times per se; they contain information about the number of substitutions that have occurred, and the number of substitutions is the product of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

PAML evolver

Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

Simulate a tree =

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

  • specify that you want to generate a rooted tree by typing 2
  • specify 20 species
  • specify 1 tree and a random number seed of your choosing
  • specify 1 to answer yes to the question about wanting branch lengths
  • specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of expected height 1.

You should now find a tree description in the file evolver.out. Rename this file before running evolver again, otherwise it will be wiped out.

Simulate sequences

The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named control.dat with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1

tree description goes here

4 
5
0 0 
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the PAML manual for more info about each option):

  • line 1: 2 specifies that we want the output as a nexus file
  • line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
  • line 3: 20 taxa, 10000 sites, 1 data set
  • line 4: -1 says to use the branch lengths in the tree description
  • line 5: tree description: paste in the tree description you generated from the first evolve run here
  • line 6: 4 specifies the HKY model
  • line 7: set kappa equal to 5
  • line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
  • line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".

evolver 5 control.dat

You should now find a file named mc.nex containing the sequence data. You will need to manually edit this file and insert the #nexus at the beginning.