Phylogenetics: Phycas Lab

From EEBedia
Revision as of 01:17, 5 April 2009 by PaulLewis (Talk | contribs) (While the MCMC analysis is running...)

Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
The goal of this lab is to introduce you to some of the features of Phycas, which is a Bayesian phylogenetic analysis program that I am co-authoring with Mark Holder (University of Kansas) and David Swofford (Duke University). Phycas is a juvenile compared to MrBayes and Beast, but it nevertheless allows you to do some things that neither of those programs do. We will use Phycas for two of these today: (1) perform MCMC analyses that allow polytomies; and (2) obtain accurate estimates of marginal likelihoods that can be used to compute Bayes Factors.

Downloading Phycas

The home page for Phycas is phycas.org (this URL actually points to a section of the EEB department's web site, so you will see the address change to one involving hydrodictyon.eeb.uconn.edu as it loads). In the download section of that web site, you will see links for the Windows version, the Mac version and the PDF manual.

Installing and starting Phycas in interactive mode


Mac logo.jpg
If you have a Mac, you will end up downloading a dmg file that contains the Mac Application Phycas.app (you may not see the .app extension depending whether "Show all file extensions" is checked in your Finder Advanced preferences). Just drag Phycas.app somewhere on your hard drive (the Applications folder is the usual place for such files). Once you have done this, first close the folder that opened when you double-clicked the dmg file and eject the disk image. Now drag Phycas.app onto your dock to make it easy to access during this lab. The reason I asked you to eject your disk image first is so that you do not accidentally drag the Phycas icon there onto your dock (if you do this, confusion will later ensue because the disk image will not be mounted after you reboot).

To start using Phycas interactively, simply click the Phycas icon on the dock. This starts a special version of the iTerm program (called iTerm4Phycas) and, after loading the Phycas libraries, presents you with the python prompt (>>>). We admit that it is somewhat confusing to bundle iTerm with Phycas: for example, all the menu items are iTerm menu items and have nothing to do with Phycas! Phycas is an extension of Python, so you invoke Phycas' capabilities by issuing Python commands at the command prompt, not by using the menus. Due to a glitch in iTerm, you may wish to issue a couple of Command+U commands using your keyboard to eliminate transparency in the background of the iTerm window.


Win logo.png
If you have Windows, download the appropriate installer (according to your version of Python) and run it (right-click and choose "Run as Administrator" if you are using Windows Vista). (If you don't remember which version of Python you have, open a command prompt window and type python -V.) The installer will first look to see whether you have the correct version of Python installed, and then will install the Phycas code in your Python site-packages directory. You should see a Phycas item appear on your All Programs list (visible when you click the Start button).

To start using Phycas interactively, open a command prompt window and type python to start python. Now type

from phycas import *

to import everything Phycas can do into Python. You are now ready to begin issuing Phycas-specific Python commands.

Finding help

Most of the time, you will use Phycas in batch mode: that is, you will create a file containing Python code and simply run that file the way you run any Python script (e.g. python myscript.py). It is handy, however, to run Phycas interactively because the online help facility can be consulted as you build up your script.

First, to see what major function Phycas supports, type the following. Note that in all discussions of using Phycas interactively, I will include the Python prompt (>>>), but you should not type the three greater-than symbols:

>>> help()

In the output that follows, you should see some part that looks like this:

The currently implemented Phycas commands are:
 
like                           randomtree
mcmc                           sim
model                          sump
ps                             sumt

For any of these commands, you can get more information about the command by typing the name of the command followed by a period and the word help. For example,

model.help

produces this output (just the first part of the output is shown:

Help entry for <phycas.Phycas.Model.Model object at 0x746450>:
model
Defines a substitution model.

Available input options:
Attribute                      Explanation
============================== =================================================
type                           Can be 'jc', 'hky' or 'gtr'

update_relrates_separately     If True, GTR relative rates will be individually
                               updated using slice sampling; if False, they will
                               be updated jointly using a Metropolis-Hastings
                               move (generally both faster and better).  

relrate_prior                  The joint prior distribution for all six GTR
                               relative rate parameters. Used only if
                               update_relrates_separately is False. 

relrate_param_prior            The prior distribution for individual GTR
                               relative rate parameters.  Used only if
                               update_relrates_separately is true.

relrates                       The current values for GTR relative rates. These
                               should be specified in this order: A<->C, A<->G,
                               A<->T, C<->G, C<->T, G<->T.
...

To skip the explanation of each setting and instead see a concise table of the current values for each setting, try this:

model.curr

This produces the following output:

Current model input settings:

Attribute                      Current Value
============================== =================================================
type                           'hky'
update_relrates_separately     True
relrate_prior                  Dirichlet((1.00000, 1.00000, 1.00000, 1.00000,
                               1.00000, 1.00000))
relrate_param_prior            Exponential(1.00000)
relrates                       [1.0, 4.0, 1.0, 1.0, 4.0, 1.0]
fix_relrates                   False
kappa_prior                    Exponential(1.00000)
kappa                          4.0
fix_kappa                      False
num_rates                      1
gamma_shape_prior              Exponential(1.00000)
gamma_shape                    0.5
fix_shape                      False
use_inverse_shape              False
pinvar_model                   False
pinvar_prior                   Beta(1.00000, 1.00000)
pinvar                         0.2
fix_pinvar                     False
update_freqs_separately        True
state_freq_prior               Dirichlet((1.00000, 1.00000, 1.00000, 1.00000))
state_freq_param_prior         Exponential(1.00000)
state_freqs                    [1.0, 1.0, 1.0, 1.0]
fix_freqs                      False
edgelen_hyperprior             InverseGamma(2.10000, 0.90909)
fix_edgelen_hyperparam         False
edgelen_hyperparam             0.05
internal_edgelen_prior         Exponential(2.00000)
external_edgelen_prior         Exponential(2.00000)
edgelen_prior                  None 
fix_edgelens                   False
use_flex_model                 False
flex_ncat_move_weight          1
flex_num_spacers               1
flex_phi                       0.25
flex_L                         1.0
flex_lambda                    1.0
flex_prob_param_prior          Exponential(1.00000)
============================== =================================================

Simulating data on a polytomous tree

One strength of Phycas is that it allows topology priors that allow for the possibility that there are polytomies in the true tree. Instead of analyzing actual sequence data, you will simulate a data set using Phycas using a polytomous model tree. You will then analyze the data in Phycas under a topology prior that either allows or disallows polytomies to see if the "star tree paradox" is evident. Each person will choose their own pseudorandom number seed, so at the end we should have results from several different simulation replicates for comparison.

The first step is to create a Python script that will simulate the data. Create a new file named sim.py and copy the following text into it:

from phycas import *
setMasterSeed(13579)
sim.taxon_labels = ['A','B','C','D','E','F','G','H','I','J','K','L','M']
sim.tree_source = TreeCollection(newick='((1,(2,3)),(4,5,(6,7)),(8,(9,10)),(11,(12,13)))')
sim.nchar = 10000
sim.file_name = 'simulated.nex'
sim()

This file accomplishes the following:

  • imports the Phycas module into Python
  • sets the global pseudorandom number seed (please choose your own seed so we don't all end up with identical results)
  • specifies the taxon labels
  • specifies the tree topology that we wish to use as the model tree (the numbers in this tree description correspond to the labels in the taxon_labels list)
  • specifies the number of sites to generate
  • specifies the name of the file to generate
  • says to "Make it so, number 1" (i.e., actually perform the simulation)

Since we did not specify the model, it will use the default model (use the model.curr command to see all the current settings for the model).

Run this file by (Mac users) dragging it from a finder window onto the Phycas icon in your dock or (Windows users) opening a console window in the folder where sim.py is located and typing python sim.py.

Note to Mac users: you should always quit Phycas completely before dragging a new file onto its icon. To quit it completely, click on the icon to bring it to the forefront, then choose "Quit iTerm" from the iTerm4Phycas menu or press Command-Q using the keyboard.

Performing a standard MCMC analysis

Create a second file named standardmcmc.py and copy the following text into it:

from phycas import *

setMasterSeed(13579)

mcmc.data_source = 'simulated.nex'
mcmc.ncycles = 2000
mcmc.sample_every = 1
mcmc.allow_polytomies = False
mcmc()

This file accomplishes the following:

  • the first line imports the Phycas module into Python
  • setMasterSeed sets the seed used by the MCMC analysis
  • mcmc.data_source specifies the data file name
  • mcmc.ncycles specifies the number of update cycles (a cycle in Phycas is analogous to, but quite different than, a generation in MrBayes --- more on this later)
  • mcmc.sample_every specifies the number of cycles before the next sample is saved (in this case, a sample is saved after every cycles, which will result in 2000 lines in the parameter file params.p and 2000 lines in the tree file trees.t)
  • mcmc.allow_polytomies specifies whether or not polytomies will be allowed (the answer is no here, but we will do a second analysis later that allows them)
  • mcmc() says to start the MCMC analysis

While the MCMC analysis is running...

After an initial dump of information about parameters and their priors, Phycas quiets down and only outputs a summary line like the following every 100 cycles:

cycle = 100, lnL = -89198.17244 (20.79998 secs)

The time shown is the elapsed time in seconds. Periodically, Phycas adjusts the efficiency of its slice samplers, and after each such adjustment outputs a summary like the following:

Slice sampler diagnostics:
  mode=0.20992, avgevals=6.730 (edge length hyperparameter)
  mode=0.11629, avgevals=7.850 (edge length hyperparameter)
  mode=3.94956, avgevals=6.620 (trs/trv rate ratio)
  mode=1.08247, avgevals=9.080 (base freq. A)
  mode=1.10869, avgevals=8.770 (base freq. C)
  mode=1.07547, avgevals=8.580 (base freq. G)
  mode=1.09889, avgevals=8.830 (base freq. T)

For each parameter (or hyperparameter) being updated by a slice sampler, the mode and average number of evaluations per update is reported. The mode is simply that value of the parameter associated with the highest posterior density during the time since the slice samplers were last adjusted. This gives you a rough idea of the current location of the parameter. The avgevals value is the average number of times the posterior density needed to be computed in order to obtain a valid new sample of the parameter.

You might have noticed that the base frequencies seem to be out of bounds (all four modes are greater than 1). Base frequency parameters in Phycas are by default left unnormalized unless they are being used in computing the likelihood. This makes it easier for the base frequencies to be independently updated by slice sampling: the slice sampler updating the frequency of A does not have to coordinate with the other three slice samplers to ensure that the sum of the base frequencies is 1.0.

Two of the entries above represent edge length hyperparameters. Rather than placing a prior directly on edge length parameters, Phycas creates two new hyperparameters that represent the prior means of branch lengths. One governs the prior mean of the internal edge lengths, the other governs the prior mean of the external edge lengths. Phycas allows you to override this default behavior if you wish and place a prior directly on either group of edge lengths.

At the end of the run, Phycas tells you how many times the likelihood function was evaluated and how long it took:

283250 likelihood evaluations in 364.22067 seconds
  = 777.68788 likelihood evaluations/sec

Summarizing trees