http://hydrodictyon.eeb.uconn.edu/eebedia/api.php?action=feedcontributions&user=Paul+Lewis&feedformat=atomEEBedia - User contributions [en]2020-06-05T22:47:33ZUser contributionsMediaWiki 1.25.2http://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41663Phylogenetics: RevBayes Lab2020-05-12T23:38:12Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and RevBayes modules:<br />
module load paml/4.9<br />
module load RevBayes/1.0.13<br />
<br />
There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rv divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41660Phylogenetics: RevBayes Lab2020-04-23T16:32:54Z<p>Paul Lewis: /* Divergence times */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and RevBayes modules:<br />
module load paml/4.9<br />
module load RevBayes/1.0.13<br />
<br />
There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rv divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41659Phylogenetics: RevBayes Lab2020-04-23T16:32:38Z<p>Paul Lewis: /* Obtaining credible intervals under the prior */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and RevBayes modules:<br />
module load paml/4.9<br />
module load RevBayes/1.0.13<br />
<br />
There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rv divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41658Phylogenetics: RevBayes Lab2020-04-23T16:32:08Z<p>Paul Lewis: /* Relaxed clocks */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and RevBayes modules:<br />
module load paml/4.9<br />
module load RevBayes/1.0.13<br />
<br />
There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41657Phylogenetics: RevBayes Lab2020-04-23T16:31:47Z<p>Paul Lewis: /* Run RevBayes */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and RevBayes modules:<br />
module load paml/4.9<br />
module load RevBayes/1.0.13<br />
<br />
There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41656Phylogenetics: RevBayes Lab2020-04-23T16:31:19Z<p>Paul Lewis: /* Login to Xanadu */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and RevBayes modules:<br />
module load paml/4.9<br />
module load RevBayes/1.0.13<br />
<br />
There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
Download the singularity image for RevBayes as follows:<br />
<br />
cd ~/rblab # just to make sure you are in the right place<br />
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41655Phylogenetics: RevBayes Lab2020-04-23T16:31:08Z<p>Paul Lewis: /* Login to Xanadu */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and RevBayes modules:<br />
module load paml/4.9<br />
module load RevBayes/1.0.13<br />
<br />
<!-- module load singularity/3.5.2 <br />
<br />
The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.<br />
--><br />
<br />
There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
Download the singularity image for RevBayes as follows:<br />
<br />
cd ~/rblab # just to make sure you are in the right place<br />
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41654Phylogenetics: RevBayes Lab2020-04-23T16:30:46Z<p>Paul Lewis: /* Login to Xanadu */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and singularity modules:<br />
module load paml/4.9<br />
module load RevBayes/1.0.13<br />
<br />
<!-- module load singularity/3.5.2 <br />
<br />
The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.<br />
--><br />
<br />
There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
Download the singularity image for RevBayes as follows:<br />
<br />
cd ~/rblab # just to make sure you are in the right place<br />
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41653Phylogenetics: RevBayes Lab2020-04-23T03:18:04Z<p>Paul Lewis: /* Obtaining credible intervals under the prior */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and singularity modules:<br />
module load paml/4.9<br />
module load singularity/3.5.2<br />
<br />
The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
Download the singularity image for RevBayes as follows:<br />
<br />
cd ~/rblab # just to make sure you are in the right place<br />
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41652Phylogenetics: RevBayes Lab2020-04-23T03:17:36Z<p>Paul Lewis: /* warning: this section is a work in progress */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and singularity modules:<br />
module load paml/4.9<br />
module load singularity/3.5.2<br />
<br />
The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
Download the singularity image for RevBayes as follows:<br />
<br />
cd ~/rblab # just to make sure you are in the right place<br />
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41651Phylogenetics: RevBayes Lab2020-04-23T03:16:57Z<p>Paul Lewis: /* Run RevBayes */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and singularity modules:<br />
module load paml/4.9<br />
module load singularity/3.5.2<br />
<br />
The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
Download the singularity image for RevBayes as follows:<br />
<br />
cd ~/rblab # just to make sure you are in the right place<br />
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41650Phylogenetics: RevBayes Lab2020-04-23T03:16:39Z<p>Paul Lewis: /* Relaxed clocks */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and singularity modules:<br />
module load paml/4.9<br />
module load singularity/3.5.2<br />
<br />
The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
Download the singularity image for RevBayes as follows:<br />
<br />
cd ~/rblab # just to make sure you are in the right place<br />
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg myscript.rev strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41649Phylogenetics: RevBayes Lab2020-04-23T03:15:58Z<p>Paul Lewis: /* Run RevBayes */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and singularity modules:<br />
module load paml/4.9<br />
module load singularity/3.5.2<br />
<br />
The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
Download the singularity image for RevBayes as follows:<br />
<br />
cd ~/rblab # just to make sure you are in the right place<br />
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg<br />
<br />
To run RevBayes, enter the following at the command prompt with the name of your revscript file last:<br />
singularity run --app rb RevBayes_Singularity_v1.0.13.simg myscript.rev strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41648Phylogenetics: RevBayes Lab2020-04-23T03:11:14Z<p>Paul Lewis: /* Login to Xanadu */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml and singularity modules:<br />
module load paml/4.9<br />
module load singularity/3.5.2<br />
<br />
The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41647Phylogenetics: RevBayes Lab2020-04-21T23:54:00Z<p>Paul Lewis: /* Obtaining credible intervals under the prior */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41646Phylogenetics: RevBayes Lab2020-04-21T23:46:53Z<p>Paul Lewis: /* Obtaining credible intervals under the prior */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}<br />
</div><br />
<br />
== Where to continue ==<br />
<br />
If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.<br />
<br />
== What to turn in ==<br />
<br />
Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41645Phylogenetics: RevBayes Lab2020-04-21T23:35:25Z<p>Paul Lewis: /* Obtaining credible intervals under the prior */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.<br />
<br />
Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. Comment out the existing lines setting up the prior for birth_rate and replace with a single line making the birth_rate a constant node:<br />
<br />
#birth_rate ~ dnExponential(0.01)<br />
#birth_rate.setValue(1.0)<br />
birth_rate <- 2.6<br />
<br />
4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41644Phylogenetics: RevBayes Lab2020-04-21T18:11:44Z<p>Paul Lewis: /* Obtaining credible intervals under the prior */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
This has been a long lab, but there is one more thing I want you to try before you go. Let's see what the credible interval sizes are under the prior. We should not change the tree topology this time, as the prior on tree topology is flat across all possible tree topologies, so we will end up with a star tree if we allow the topology to be modified.<br />
<br />
You should copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:<br />
<br />
1. change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);<br />
<br />
2. comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;<br />
<br />
3. change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and<br />
<br />
4. change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):<br />
<br />
mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)<br />
mymcmc.run(generations=1000000, underPrior=TRUE)<br />
<br />
Now run the file as usual:<br />
rb divprior.Rev<br />
<br />
Open the ''revpriorMAP.tre'' file and make the node bars equal the 95% HPD intervals. This time it should look quite different!</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41643Phylogenetics: RevBayes Lab2020-04-21T15:45:02Z<p>Paul Lewis: /* Review results of the divergence time analysis */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer. <br />
<br />
Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}<br />
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}<br />
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}<br />
</div><br />
<br />
== Obtaining credible intervals under the prior ==<br />
<br />
This has been a long lab, but there is one more thing I want you to try before you go. Let's see what the credible interval sizes are under the prior. We should not change the tree topology this time, as the prior on tree topology is flat across all possible tree topologies, so we will end up with a star tree if we allow the topology to be modified.<br />
<br />
You should copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and change all the file names to have prefix ''divprior'' so as not to overwrite your previous results, but other than that, the only thing that needs to be changes is the mcmc command:</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41642Phylogenetics: RevBayes Lab2020-04-21T15:25:31Z<p>Paul Lewis: /* Divergence times */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:<br />
# Summarize divergence times<br />
<br />
tt = readTreeTrace("output/divtime.trees", "clock")<br />
tt.summarize()<br />
mapTree(tt, "output/divtimeMAP.tre")<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''2nd question'' {{title|xxx|answer}}<br />
* ''3rd question'' {{title|xxx|answer}}<br />
* ''4th question'' {{title|xxx|answer}}<br />
* ''5th question'' {{title|xxx|answer}}<br />
</div></div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41641Phylogenetics: RevBayes Lab2020-04-21T14:56:32Z<p>Paul Lewis: /* Divergence times */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
Be prepared to wait for awhile longer this time; we've added a lot of extra work to the analysis.<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''2nd question'' {{title|xxx|answer}}<br />
* ''3rd question'' {{title|xxx|answer}}<br />
* ''4th question'' {{title|xxx|answer}}<br />
* ''5th question'' {{title|xxx|answer}}<br />
</div></div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41640Phylogenetics: RevBayes Lab2020-04-21T14:51:27Z<p>Paul Lewis: /* Review results of the divergence time analysis */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}<br />
* ''2nd question'' {{title|xxx|answer}}<br />
* ''3rd question'' {{title|xxx|answer}}<br />
* ''4th question'' {{title|xxx|answer}}<br />
* ''5th question'' {{title|xxx|answer}}<br />
</div></div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41639Phylogenetics: RevBayes Lab2020-04-21T14:46:04Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div><br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:<br />
rb divtime.Rev<br />
<br />
== Review results of the divergence time analysis ==<br />
<br />
Open the ''divtime.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''1st question'' {{title|xxx|answer}}<br />
* ''2nd question'' {{title|xxx|answer}}<br />
* ''3rd question'' {{title|xxx|answer}}<br />
* ''4th question'' {{title|xxx|answer}}<br />
* ''5th question'' {{title|xxx|answer}}<br />
</div></div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41638Phylogenetics: RevBayes Lab2020-04-21T14:42:27Z<p>Paul Lewis: /* Relaxed clocks */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
branch_rates[i].setValue(1.0)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
Note this line:<br />
branch_rates[i].setValue(1.0)<br />
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).<br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Divergence times ==<br />
<br />
=== warning: this section is a work in progress ===<br />
<br />
So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.<br />
<br />
In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.<br />
<br />
Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':<br />
<br />
cp relaxed.Rev divtime.Rev<br />
<br />
Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":<br />
# Tree moves<br />
<br />
# Add moves that modify all node times except the root node<br />
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)<br />
<br />
# Add several moves that modify the tree topology<br />
moves[nmoves++] = mvNNI(timetree, weight=5.0)<br />
moves[nmoves++] = mvNarrow(timetree, weight=5.0)<br />
moves[nmoves++] = mvFNPR(timetree, weight=5.0) <br />
<br />
Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.<br />
<br />
Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div></div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41637Phylogenetics: RevBayes Lab2020-04-20T22:44:15Z<p>Paul Lewis: /* Review results of the relaxed clock analysis */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mean and ucln_stddev).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is strange in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume; they are the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}<br />
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}<br />
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}<br />
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}<br />
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}<br />
</div></div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41636Phylogenetics: RevBayes Lab2020-04-20T22:27:43Z<p>Paul Lewis: /* Reviewing results of the relaxed clock analysis */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.<br />
<br />
# MCMC<br />
<br />
mymcmc = mcmc(mymodel, monitors, moves, nruns=1)<br />
mymcmc.burnin(generations=1000, tuningInterval=100)<br />
mymcmc.run(generations=10000)<br />
mymcmc.operatorSummary()<br />
<br />
quit()<br />
<br />
== Run RevBayes ==<br />
<br />
To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:<br />
rb strict.Rev<br />
<br />
If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.<br />
<br />
== Reviewing the strict clock results ==<br />
<br />
First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.<br />
<br />
Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}<br />
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}<br />
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}<br />
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}<br />
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}<br />
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}<br />
</div><br />
<br />
You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.<br />
<br />
== Relaxed clocks ==<br />
<br />
It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.<br />
<br />
Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':<br />
cp strict.Rev relaxed.Rev<br />
<br />
Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:<br />
<br />
# Uncorrelated Lognormal relaxed clock<br />
<br />
# Add hyperparameters mu and sigma<br />
ucln_mu ~ dnNormal(0.0, 100)<br />
ucln_sigma ~ dnExponential(.01)<br />
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
# Create a vector of stochastic nodes representing branch rate parameters<br />
n_branches <- 2*n_taxa - 2<br />
for(i in 1:n_branches) {<br />
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)<br />
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)<br />
}<br />
<br />
You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")<br />
<br />
[[Image:Lognormal.png|thumb|right]]<br />
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mean and ucln_stddev).<br />
<br />
To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is strange in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume; they are the mean and standard deviation of the ''log'' of the lognormally-distributed variable! <br />
<br />
You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
And don't forget to change the name of the dot file:<br />
<br />
mymodel.graph("relaxed.dot", TRUE, "white")<br />
<br />
Now run the new model:<br />
rb relaxed.Rev<br />
<br />
== Review results of the relaxed clock analysis ==<br />
<br />
If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!<br />
<br />
Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_RevBayes_Lab&diff=41635Phylogenetics: RevBayes Lab2020-04-20T22:23:23Z<p>Paul Lewis: /* Relaxed clocks */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]] <br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.<br />
|}<br />
<br />
== Getting started ==<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
Once you are transferred to a free node, load the paml, paup, and revbayes modules<br />
module load paml/4.9<br />
module load paup/4.0a-166<br />
module load RevBayes/xxx<br />
<br />
== Create a directory ==<br />
Use the unix <tt>mkdir</tt> command to create a directory to play in today:<br />
cd ~ # you can omit this line if you are already in your home directory<br />
mkdir rblab<br />
<br />
== Simulating and analyzing under the strict clock model ==<br />
<br />
Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.<br />
<br />
We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.<br />
<br />
=== PAML evolver ===<br />
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.<br />
<br />
We will each use a different random number seed, so we should all get slightly different answers.<br />
<br />
==== Simulate a tree ====<br />
<br />
First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):<br />
<br />
* specify that you want to generate a rooted tree by typing 2<br />
* specify 20 species<br />
* specify 1 tree and a random number seed of ''your'' choosing<br />
* specify 1 to answer yes to the question about wanting branch lengths<br />
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height<br />
* press 0 to quit<br />
<br />
One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1. <br />
<br />
You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):<br />
mv evolver.out tree.txt<br />
<br />
==== Simulate sequences ====<br />
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):<br />
<br />
2<br />
seed goes here<br />
20 10000 1<br />
-1<br />
tree description goes here<br />
4 <br />
5<br />
0 0 <br />
0.1 0.2 0.3 0.4<br />
<br />
Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):<br />
* line 1: 2 specifies that we want the output as a nexus file<br />
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)<br />
* line 3: 20 taxa, 10000 sites, 1 data set<br />
* line 4: -1 says to use the branch lengths in the tree description<br />
* line 5: tree description: paste in the tree description you generated from the first evolve run here<br />
* line 6: 4 specifies the HKY model<br />
* line 7: set kappa equal to 5<br />
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)<br />
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)<br />
<br />
When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:<br />
echo "#nexus" > paupstart<br />
touch paupblock<br />
touch paupend<br />
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.<br />
<br />
Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".<br />
evolver 5 control.dat<br />
<br />
If you get ''Error: err tree...'' it means that you did not follow the directions above ;)<br />
<br />
You should now find a file named ''mc.nex'' containing the sequence data.<br />
<br />
== Use RevBayes to estimate the birth rate and clock rate ==<br />
<br />
In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.<br />
<br />
RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!<br />
<br />
=== Set up the tree submodel ===<br />
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.<br />
# Load data and tree<br />
<br />
D <- readDiscreteCharacterData(file="mc.nex")<br />
n_sites <- D.nchar()<br />
<br />
T <- readTrees("tree.txt")[1]<br />
n_taxa <- T.ntips()<br />
taxa <- T.taxa()<br />
<br />
# Initialize move (nmoves) and monitor (nmonitors) counters<br />
<br />
nmoves = 1<br />
nmonitors = 1<br />
<br />
# Birth-death tree model<br />
<br />
death_rate <- 0.0<br />
birth_rate ~ dnExponential(0.01)<br />
birth_rate.setValue(1.0)<br />
diversification := birth_rate - death_rate<br />
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
sampling_fraction <- 1.0<br />
root_time <- T.rootAge()<br />
timetree ~ dnBDP(lambda = birth_rate, <br />
mu = death_rate, <br />
rho = sampling_fraction, <br />
rootAge = root_time, <br />
samplingStrategy = "uniform", <br />
condition = "nTaxa", <br />
taxa = taxa)<br />
timetree.setValue(T)<br />
<br />
Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).<br />
<br />
The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.<br />
<br />
The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary. <br />
<br />
Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0. <br />
<br />
Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.<br />
<br />
The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:<br />
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.<br />
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.<br />
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).<br />
<br />
I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.<br />
<br />
=== Set up the strict clock submodel ===<br />
<br />
Add the following 3 lines to your growing revscript:<br />
<br />
# Strict clock<br />
<br />
clock_rate ~ dnExponential(0.01)<br />
clock_rate.setValue(1.0) <br />
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)<br />
<br />
This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.<br />
<br />
=== Set up the substitution submodel ===<br />
<br />
Now let's set up a GTR substitution model:<br />
<br />
# GTR model<br />
<br />
state_freqs ~ dnDirichlet(v(1,1,1,1))<br />
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))<br />
Q := fnGTR(exchangeabilities, state_freqs)<br />
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)<br />
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)<br />
<br />
The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).<br />
<br />
I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.<br />
<br />
=== Finalize the PhyloCTMC ===<br />
<br />
It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.<br />
<br />
# PhyloCTMC<br />
<br />
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")<br />
phySeq.clamp(D)<br />
mymodel = model(exchangeabilities)<br />
mymodel.graph("strict.dot", TRUE, "white")<br />
<br />
The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.<br />
<br />
The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.<br />
<br />
=== Set up monitors ===<br />
<br />
Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:<br />
<br />
# Monitors<br />
<br />
monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB) <br />
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)<br />
monitors[nmonitors++] = mnScreen(printgen=100)<br />
<br />
The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.<br />
<br />
Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.<br />
<br />
=== Set up MCMC ===<br />
<br />
Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves wa