Phylogenetics: HyPhy Lab

From EEBedia
Revision as of 20:54, 24 February 2020 by Paul Lewis (Talk | contribs) (Creating the HyPhy batch file)

Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics

Goal

The goal of this lab exercise is to show you how to use the HyPhy program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.

Login to Xanadu

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Loading modules needed

Load the module needed for this exercise:

module load hyphy/2.5.4
module load paup/4.0a-166
module load python/2.7.8

The goal of this lab

In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:

RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]

In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.

The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.

This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine.

Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.

Creating the HyPhy batch file

HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named bats.bf in a directory called bats (you can of course use any name you like, but these are the names I will be using). Also create a directory simdata inside your bats directory.

Download the data we will use for this analysis using curl:

cd ~/bats
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex

This is what your directory structure should look like at this point:

bats/
    simdata/
    bats.bf
    irbp.nex

Open bats.bf in your favorite text editor (e.g. nano) and enter the following text:

AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1;  
ACCEPT_ROOTED_TREES = TRUE;  
ASSUME_REVERSIBLE_MODELS = -1;

/*****************************************************************************************
| Read in the data and store the result in the variable nucleotideSequences.
*/
DataSet nucleotideSequences = ReadDataFile("irbp.nex");

/*****************************************************************************************
| Filter the data, specifying which sites are to be used. The first 1 means treat each 
| site separately (3 would cause the data to be interpreted as codons). The quoted 
| string (last argument) specifies which sites (where first site = 0) to use (we are 
| excluding sites with a lot of missing data and or alignment issues). This leaves us with
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine
| exactly which sites were excluded in the original study.
*/
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");

/*****************************************************************************************
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means
| unit=1, atom=1, position-specific=1. These settings create one global set of 
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, 
| which is not what we need because we will not be using a codon model).
*/
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);
fprintf(stdout, observedFreqs);

Run bats.bf

Run your bats.bf file as follows:

hyphy bats.bf

You should see the empirical nucleotide frequencies displayed as follows:

{
{0.1936054742803209}
{0.3104845052697813}
{0.3106418121755545}
{0.1852682082743432}
}

I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (/* comment */) or begin the line with two slashes (// comment). The three lines at the beginning will be explained when each becomes relevant.

The fprintf command also mimics the C programming language. It allows you to print objects to stdout (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable observedFreqs.

Define the HKY85 rate matrix

Add the following to the bottom of your bats.bf file:

/*****************************************************************************************
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is 
| declared to be global, meaning it is shared by all edges.
*/
global kappa = 4.224618;
HKY85RateMatrix = 
		{{     *      ,    betat    ,    betat*kappa  ,     betat    }
		 {    betat    ,     *      ,       betat     ,  betat*kappa }
		 { betat*kappa ,    betat    ,         *      ,     betat    }
		 {    betat    , betat*kappa ,       betat     ,       *     }};
fprintf(stdout, HKY85RateMatrix);

Run bats.bf in HyPhy. Did the HKY85RateMatrix variable have the value you expected? Why or why not? (Hint: the variable betat was initialized to zero.) Set betat = 1; before the fprintf statement and run the batch file again to see the effect. Does the output make sense now?

Combine frequencies with rate matrix to create an HKY85 model

/*****************************************************************************************
| Define the HKY85 model, by combining the substitution matrix with the vector of 
| empirical frequencies. 
*/
Model hky1 = (HKY85RateMatrix, observedFreqs);

Now we have a model variable (hky1) that can be applied to each edge of a tree.

Create a tree representing the null hypothesis

The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.

/*****************************************************************************************
| Define the tree variable, using the tree description read from the data file.
| By default, the last defined model (hky1) is assigned to all edges of the tree. 
*/
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;

Creating the second model

Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the hky1 model has been applied to every node in the tree. You can verify this using the following HBL code:

GetInformation(modelMap, constrainedTree);
fprintf(stdout, modelMap);

Add these two lines to the bottom of your bats.bf file and run it. You should see this near the bottom of your output:

{
 "Bos":"hky1",
 "Cynocephalus":"hky1",
 "Didelphis":"hky1",
 "Felis":"hky1",
 "Galago":"hky1",
 "Homo":"hky1",
 "Mus":"hky1",
 "Node1":"hky1",
 "Node10":"hky1",
 "Node11":"hky1",
 "Node14":"hky1",
 "Node19":"hky1",
 "Node2":"hky1",
 "Node3":"hky1",
 "Node4":"hky1",
 "Node5":"hky1",
 "Node7":"hky1",
 "Oryctolagus":"hky1",
 "Pteropus":"hky1",
 "Tarsius":"hky1",
 "Tonatia_bidens":"hky1",
 "Tonatia_silvicola":"hky1",
 "Tupaia":"hky1"
}

First, create a second model named hky2 and apply it to four specific edges in the tree:

/*****************************************************************************************
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. 
*/
highATfreqs = {{.45}{.05}{.05}{.45}};
Model hky2 = (HKY85RateMatrix, highATfreqs);
SetParameter(constrainedTree.microbats, MODEL, hky2);
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);
SetParameter(constrainedTree.Pteropus, MODEL, hky2);

If you move your GetInformation call and its accompanying fprintf after these 6 lines, then you should see this in the output:

{
 "Bos":"hky1",
 "Cynocephalus":"hky1",
 "Didelphis":"hky1",
 "Felis":"hky1",
 "Galago":"hky1",
 "Homo":"hky1",
 "Mus":"hky1",
 "Node1":"hky1",
 "Node10":"hky1",
 "Node11":"hky1",
 "Node14":"hky1",
 "Node2":"hky1",
 "Node3":"hky1",
 "Node4":"hky1",
 "Node5":"hky1",
 "Node7":"hky1",
 "Oryctolagus":"hky1",
 "Pteropus":"hky2",
 "Tarsius":"hky1",
 "Tonatia_bidens":"hky2",
 "Tonatia_silvicola":"hky2",
 "Tupaia":"hky1",
 "microbats":"hky2"
}

Don't get too comfortable just yet

It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:

fprintf(stdout, "\n\nCurrent edge lengths:\n");
edgeNames 	= BranchName(constrainedTree,-1); 
edgeLengths	= BranchLength(constrainedTree,-1);
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {
	fprintf(stdout, Format(edgeLengths[k],10,5), "  ", edgeNames[k], "\n");
}

The first line simply skips a couple of lines (\n\n) and prints a header announcing that current edge lengths will follow.

The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The -1 in these functions is somewhat obscure, but means "give me all of them". (If you used 5 rather than -1, it would give you the name of the edge having index 5.)

The last 3 lines are a loop in which the variable k ranges from 0 to the number of edges minus 1. For each value of k, the fprintf statement prints out the length of edge k (the Format command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge k. The newline character ("\n") at the end of the fprintf statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.

After running bats.bf, what possibly important detail do you notice about the lengths of the edges to which we attached the hky2 model?

Fixing up edge lengths

The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the betat parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1 that we placed at the top of the file saves us from having to go through this procedure for all the hky1 edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.

Recall that, under the JC69 model, the edge length equals v = 3*beta*t. The betat parameter that appears in our HKY85RateMatrix is the beta*t part, so to set the betat parameter we need to divide the desired edge length by 3: that is, betat = v/3. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to betat values.

betat = 1.0;
scalingFactor = 0.0;
for (n1 = 0; n1 < 4; n1 = n1+1) {
    for (n2 = 0; n2 < 4; n2 = n2+1) {
        if (n2!=n1) {
            scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];
        }
    }
}

The two nested loops visit every off-diagonal element of the HKY85RateMatrix, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum scalingFactor. Review your lecture notes if you don't remember why scalingFactor is computed in this way. Note that it is important to set betat = 1 before the loop in order to ensure that the scaling factor works correctly.

Now we simply need to set the betat values for the four edges of interest:

constrainedTree.microbats.betat           := 0.097851/scalingFactor;
constrainedTree.Tonatia_bidens.betat   := 0.008252/scalingFactor;
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;
constrainedTree.Pteropus.betat             := 0.104663/scalingFactor;

Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.

Constructing the likelihood function

/*****************************************************************************************
| Build the likelihood function and print it out, which causes HyPhy to actually compute
| the log-likelihood for the tree.
*/
LikelihoodFunction likelihood = (filteredData, constrainedTree);
fprintf(stdout, likelihood);

If you add the 2 lines above to your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to -6472.478318.

The ASSUME_REVERSIBLE_MODELS = -1 that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The ACCEPT_ROOTED_TREES = TRUE at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.

Ready to simulate!

We are now ready to perform the simulations. As soon as you create a LikelihoodFunction object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.

Here is the simulation loop:

/*****************************************************************************************
| Perform the simulations. 
*/
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {
    // Simulate a data set of the same size as the original set
    DataSet simulatedData = SimulateDataSet(likelihood);

    // The filter is necessary, but trivial in this case because alls sites are used 
    DataSetFilter filteredSimData = CreateFilter(simulatedData,1);

    // Save simulated data to a file
    outFile = "simdata/sim"+simCounter+".nex";
    fprintf(outFile, filteredSimData);
}

I have used both styles of comments here: the main comment before the loop is done using the /* ... */ style, and the double-slash style is used for comments within the loop. The LikelihoodFunction object likelihood is passed to the SimulateDataSet function to generate the data. It is always necessary to create a DataSetFilter in HyPhy, even if no filtering occurs. If one prints filteredSimData using fprintf to stdout, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving fprintf a file name rather than stdout, which causes the simulated data to be saved to that file. The file name is constructed using simCounter so that every simulated file has a name that is different. Note that all simulated data will be saved in the simdata directory that you created early on in this tutorial.

Using PAUP* to estimate parsimony trees for each simulated data set

HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.

Create the following file in the same directory as bats.bf. Call the new file parsimony.nex.

#nexus

begin paup;
    log file=simdata/pauplog.txt start replace;
    set criterion=parsimony warnreset=no warntsave=no;
end;

begin python;
    for i in range(10):
        paup.cmd("exe simdata/sim%d.nex;" % i)
        paup.cmd("hsearch;")
        paup.cmd("contree all;")
        paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")
        paup.cmd("filter constraints=flyingdna;")
end;

begin paup;
    log stop;
    quit;
end;

This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.

The first paup block starts a log (which will be created in the simdata directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.

The final paup bock just closes the log file and then quits.

The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a paup.cmd function call.

The first paup.cmd executes a file named sim%d.nex inside the simdata directory, where %d is replaced by the value of i. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.

The second paup.cmd performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.

The third paup.cmd creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.

We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named flyingdna and then filter the trees resulting from the parsimony search using the flyingdna constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.

Run this file in PAUP* to generate the pauplog.txt file, then look through that file to see how many replicates yielded bat monophyly.

Final exercise

Once you confirm that your scripts are working, run your bats.bf using HyPhy followed by running parsimony.nex in PAUP* a total of 6 times, each with a different value of highATfreqs that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both bats.bf and parsimony.nex so that you get more accurate y-axis values.

Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file pauplog.txt:

cat simdata/pauplog.txt | grep "Number of trees retained by filter"

The cat command simply dumps a file to the screen. Instead of sending the output to the console, the (|) causes the output to instead be piped into the grep command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.

How does your plot compare to the one published in Vandenbussche et al. (1998)?