Phylogenetics: BEAST2 Lab
|EEB 5349: Phylogenetics|
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the strict clock, random local clocks or the uncorrelated lognormal model fits the data best using estimates of the log marginal likelihood.|
- 1 Getting started
- 1.1 (q)login to the cluster
- 1.2 Create a directory
- 1.3 Download and save the data file
- 1.4 Download BEAST2
- 1.5 Divergence Dating Tutorial
- 1.6 Marginal Likelihood Estimation
- 1.7 Further Information (no need to read this for today's lab)
(q)login to the cluster
Login to the Bioinformatics Facility Dell cluster:
Then use the qlogin command to find a free node:
Once you are transferred to a free node, type
module load java/1.8.0 module load beast/2.5.0
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.
Create a directory
Use the unix mkdir command to create a directory to play in today:
mkdir beastlab cd beastlab
Download and save the data file
The tutorial uses a data file named primate-mtDNA.nex that is located on the cluster. Here's how to copy it to your beastlab directory:
cp $BEAST/examples/nexus/primate-mtDNA.nex .
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory).
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)
Divergence Dating Tutorial
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2.
Important: read before starting the tutorial
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:
- strict clock (set priors to what tutorial advises)
- uncorrelated lognormal relaxed clock (set priors to what tutorial advises, except use Gamma with shape=0.001, scale=1000 prior for ucldMean.cclock)
- random local clocks (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)
It is important that you pay attention and do not simply plow through section 2.3 at full tilt because then everyone will end up doing only the strict clock analysis!
Here is the tutorial
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:
Marginal Likelihood Estimation
Steps for setting up your XML file for steppingstone
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:
- Search for "<run" and change it to "<mcmc"
- Search for "</run>" and change it to "</mcmc>"
- Before "<mcmc", add the following lines
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'> cd $(dir) java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml
- After "</mcmc>" add the following line:
The sections below show what the file looks like before and after these modifications.
XML file before modification
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <beast ...> . . . <run id="mcmc" spec="MCMC" chainLength="6000000"> . . . </run> </beast>
XML file after modification
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <beast ...> . . . <run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'> cd $(dir) java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <mcmc id="mcmc" spec="MCMC" chainLength="6000000"> . . . </mcmc> </run> </beast>
The steppingstone method (called path sampling in beast documentation) requires samples from a series of MCMC analysis, each exploring a slightly different distribution (and, strangely, none of these equals the posterior distribution!). Each distribution represents a power posterior: a distribution like the posterior except that the likelihood is raised to a power between 0.0 and 1.0. Each power posterior encloses the next one, like matryoshka dolls, allowing accurate estimate of the ratio of the areas of that pair of distributions. Multiplying successive ratios results in cancellation of the numerator of one with the denominator of the next, so that the end result is an estimate of the ratio of the area of the posterior kernel to the area of the prior. Because the area of the prior is 1.0, this overall ratio equals the marginal likelihood.
The nrOfSteps run attribute specifies the number of these MCMC analyses to perform, and the alpha run attribute determines the spacing of the distributions. The XML file you created with BEAUTi specifies everything needed for each component MCMC analysis, so our modifications to the XML file mainly involve wrapping the original run in a shell that causes it to be executed nrOfSteps times. The PathSampler module specified in the spec run attribute handles putting the results together to obtain an estimate of the marginal likelihood.
The only run attributes I haven't mentioned are deleteOldLogs and rootdir. The deleteOldLogs attribute comes into play if you try to repeat an analysis. The PathSampling module will not start an analysis if old log files are still lying around unless you've told it that it is okay to delete old logs. The rootdir attribute specifies the directory used to store the results of each separate MCMC analysis. These have to be stored somewhere until all are finished. Here I've had you specify rootdir=ss, which causes an ss folder to appear inside your beastlab folder. Inside that ss folder you should see nrOfSteps folders appear, numbered step0, step1, ..., step<nrOfSteps-1>.
Running your XML file
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.
Further Information (no need to read this for today's lab)
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.
Marginal likelihood estimation on your own laptop
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's File > Manage Packages... menu item. The packages are normally installed in the .beast directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options. Note that the term particles is synonymous with steppingstones or ratios.
- alpha: alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.
- nrOfSteps: the number of steps to use, default 8
- rootdir: root directory for storing particle states and log files (default /tmp)
- mcmc: MCMC analysis used to specify model and operations in each of the particles
- chainLength: number of sample to run a chain for a single step (default 100000L)
- burnInPercentage: burn-In Percentage used for analysing log files (default 50)
- preBurnin: number of samples that are discarded for the first step, but not the others (default 100000)
- value: script for launching a job:
- $(dir) is replaced by the directory associated with the particle]
- $(java.class.path) is replaced by a java class path used to launch this application
- $(java.library.path) is replaced by a java library path used to launch this application
- $(seed) is replaced by a random number seed that differs with every launch
- $(host) is replaced by a host from the list of hosts
- hosts: comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed
- doNotRun: Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false)
- deleteOldLogs: delete existing log files from root dir (default false)
- posterior2prior: whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)