Phylogenetics: HyPhy Lab

From EEBedia
Revision as of 18:16, 24 February 2020 by Paul Lewis (Talk | contribs)

Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics


The goal of this lab exercise is to show you how to use the HyPhy program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.

Login to Xanadu

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Loading modules needed

Load the module needed for this exercise:

module load hyphy/2.3.11
module load paup/4.0a-166

The goal of this lab

In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:

RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531]

In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.

The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.

This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine.

Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.

Creating the HyPhy batch file

HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named in a directory called bats (you can of course use any name you like, but these are the names I will be using). Also create a directory simdata inside your bats directory.

Download the data we will use for this analysis using curl:

cd ~/bats
curl -LO

This is what your directory structure should look like at this point:


Open in your favorite text editor (e.g. nano) and enter the following text:


| Read in the data and store the result in the variable nucleotideSequences.

DataSet nucleotideSequences = ReadDataFile("irbp.nex");

| Filter the data, specifying which sites are to be used. The first 1 means treat each 
| site separately (3 would cause the data to be interpreted as codons). The quoted 
| string (last argument) specifies which sites (where first site = 0) to use (we are 
| excluding sites with a lot of missing data and or alignment issues). This leaves us with
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine
| exactly which sites were excluded in the original study.

DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");

| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means
| unit=1, atom=1, position-specific=1. These settings create one global set of 
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, 
| which is not what we need because we will not be using a codon model).

HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);
fprintf(stdout, observedFreqs);

=== Run

Run your file as follows:


You should see the empirical nucleotide frequencies displayed as follows:


I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (/* comment */) or begin the line with two slashes (// comment). The three lines at the beginning will be explained when each becomes relevant.

The fprintf command also mimics the C programming language. It allows you to print objects to stdout (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable observedFreqs.

=== Define the HKY85 rate matrix

Add the following to the bottom of your file:

| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is 
| declared to be global, meaning it is shared by all edges.
global kappa = 4.224618;
HKY85RateMatrix = 
		{{     *      ,    beta    ,    beta*kappa  ,     beta    }
		 {    beta    ,     *      ,       beta     ,  beta*kappa }
		 { beta*kappa ,    beta    ,         *      ,     beta    }
		 {    beta    , beta*kappa ,       beta     ,       *     }};
fprintf(stdout, HKY85RateMatrix);

Run in HyPhy. Did the HKY85RateMatrix variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set beta = 1; before the fprintf statement and run the batch file again to see the effect. Does the output make sense now?

=== Combine frequencies with rate matrix to create an HKY85 model

| Define the HKY85 model, by combining the substitution matrix with the vector of 
| empirical frequencies. 
Model hky1 = (HKY85RateMatrix, observedFreqs);

Now we have a model variable (hky1) that can be applied to each edge of a tree.

=== Create a tree representing the null hypothesis

The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.

| Define the tree variable, using the tree description read from the data file.
| By default, the last defined model (hky1) is assigned to all edges of the tree. 
constrainedTopology = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0):0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";
Tree constrainedTree = constrainedTopology;