Difference between revisions of "Phylogenetics: RevBayes Lab"
Paul Lewis (Talk | contribs) (→Simulating and analyzing under the strict clock model) |
Paul Lewis (Talk | contribs) |
||
Line 4: | Line 4: | ||
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span> | |<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span> | ||
|- | |- | ||
− | |The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is | + | |The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes. |
|} | |} | ||
Line 13: | Line 13: | ||
srun --pty -p mcbstudent --qos=mcbstudent bash | srun --pty -p mcbstudent --qos=mcbstudent bash | ||
− | Once you are transferred to a free node, | + | Once you are transferred to a free node, load the paml, paup, and revbayes modules |
+ | module load paml/4.9 | ||
+ | module load paup/4.0a-166 | ||
module load RevBayes/xxx | module load RevBayes/xxx | ||
− | + | == Create a directory == | |
Use the unix <tt>mkdir</tt> command to create a directory to play in today: | Use the unix <tt>mkdir</tt> command to create a directory to play in today: | ||
− | cd ~ | + | cd ~ # you can omit this line if you are already in your home directory |
mkdir rblab | mkdir rblab | ||
− | + | == Simulating and analyzing under the strict clock model == | |
− | Divergence time analyses are the | + | Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible. |
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data. | We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data. | ||
− | + | === PAML evolver === | |
− | Let's simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, | + | Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation. |
− | We will use | + | We will each use a different random number seed, so we should all get slightly different answers. |
− | + | ||
+ | ==== Simulate a tree ===== | ||
+ | |||
+ | First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space): | ||
− | + | * specify that you want to generate a rooted tree by typing 2 | |
− | + | * specify 20 species | |
− | + | * specify 1 tree and a random number seed of ''your'' choosing | |
− | + | * specify 1 to answer yes to the question about wanting branch lengths | |
− | + | * specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height | |
− | + | ||
− | + | ||
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. | One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. | ||
Line 45: | Line 48: | ||
You should now find a tree description in the file ''evolver.out''. Rename this file before running evolver again, otherwise it will be wiped out. | You should now find a tree description in the file ''evolver.out''. Rename this file before running evolver again, otherwise it will be wiped out. | ||
− | The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents ( | + | ==== Simulate sequences ==== |
+ | The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description): | ||
− | 2 | + | 2 |
− | + | seed goes here | |
− | 20 10000 1 | + | 20 10000 1 |
− | -1 | + | -1 |
− | + | tree description goes here | |
− | 4 | + | 4 |
− | 5 | + | 5 |
− | 0 0 | + | 0 0 |
− | 0.1 0.2 0.3 0.4 | + | 0.1 0.2 0.3 0.4 |
− | + | Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option): | |
+ | * line 1: 2 specifies that we want the output as a nexus file | ||
+ | * line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree) | ||
+ | * line 3: 20 taxa, 10000 sites, 1 data set | ||
+ | * line 4: -1 says to use the branch lengths in the tree description | ||
+ | * line 5: tree description: paste in the tree description you generated from the first evolve run here | ||
+ | * line 6: 4 specifies the HKY model | ||
+ | * line 7: set kappa equal to 5 | ||
+ | * line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity) | ||
+ | * line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!) | ||
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets". | Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets". | ||
evolver 5 control.dat | evolver 5 control.dat | ||
− | You should now find a file named ''mc.nex'' containing the sequence data. You will need to manually edit this file and insert the <tt>#nexus</tt> at the beginning. | + | You should now find a file named ''mc.nex'' containing the sequence data. You will need to manually edit this file and insert the <tt>#nexus</tt> at the beginning. |
Revision as of 19:56, 19 April 2020
EEB 5349: Phylogenetics | |
The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using RevBayes. There are other programs that are currently more popular than RevBayes for doing this (notably BEAST2), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes. |
Contents
Getting started
Login to Xanadu
Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash
Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9 module load paup/4.0a-166 module load RevBayes/xxx
Create a directory
Use the unix mkdir command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory mkdir rblab
Simulating and analyzing under the strict clock model
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about substitution rates or divergence times per se; they contain information about the number of substitutions that have occurred, and the number of substitutions is the product of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.
PAML evolver
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.
We will each use a different random number seed, so we should all get slightly different answers.
Simulate a tree =
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):
- specify that you want to generate a rooted tree by typing 2
- specify 20 species
- specify 1 tree and a random number seed of your choosing
- specify 1 to answer yes to the question about wanting branch lengths
- specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of expected height 1.
You should now find a tree description in the file evolver.out. Rename this file before running evolver again, otherwise it will be wiped out.
Simulate sequences
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named control.dat with the following contents (2 lines require modification: seed and tree description):
2 seed goes here 20 10000 1 -1
tree description goes here
4 5 0 0 0.1 0.2 0.3 0.4
Here's what each of those lines does (consult the evolver section of the PAML manual for more info about each option):
- line 1: 2 specifies that we want the output as a nexus file
- line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
- line 3: 20 taxa, 10000 sites, 1 data set
- line 4: -1 says to use the branch lengths in the tree description
- line 5: tree description: paste in the tree description you generated from the first evolve run here
- line 6: 4 specifies the HKY model
- line 7: set kappa equal to 5
- line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
- line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat
You should now find a file named mc.nex containing the sequence data. You will need to manually edit this file and insert the #nexus at the beginning.