Difference between revisions of "Phylogenetics: HyPhy Lab"

From EEBedia
Jump to: navigation, search
(Computing the likelihood under a local codon model)
m (Create a tree representing the null hypothesis)
 
(148 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Under Construction}}
 
 
{| border="0"
 
{| border="0"
 
|-
 
|-
Line 5: Line 4:
 
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span>
 
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span>
 
|-
 
|-
|The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework.
+
|
 
|}
 
|}
 +
== Goals ==
  
== Part A: Obtaining the sequences ==
+
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.
  
A Nexus data file containing sequences and a tree is located here: [http://hydrodictyon.eeb.uconn.edu/EEB349/wickett.nex wickett.nex]. This dataset was assembled by our own [http://hydrodictyon.eeb.uconn.edu/people/wickett/ Norm Wickett] and contains several sequences of bryophytes, including two from a parasitic bryophyte that is non-green and does not photosynthesize. The sequences in the file you will analyze today are from a gene important to photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).
+
The lab is divided into two parts, both of which recreate analyses done in past papers.  
  
Before going any further, you might want to review the [[Phylogenetics: Syllabus#codon_models|lecture from Monday, Feb. 19 on codon models]].
+
The '''first part''' is designed to show you '''how easy it is to use HyPhy''' when what you want to do is available as a canned routine. In this example, we recreate part of the paper by [http://dx.doi.org/10.1007/s00239-008-9133-1 Wickett et al. 2008 ] in which the goal was to demonstrate that selection is relaxed in a gene no longer being used. You may remember that  [http://www.chicagobotanic.org/research/staff/wickett Norm Wickett] gave the keynote address at this year's EEB Graduate Student Symposium. This study was done when he was still here at UConn.
  
== Part B: Using a codon model ==
+
The '''second part''' shows that HyPhy allows you to test hypotheses that are nearly impossible to do with any other software, although going outside HyPhy's collection of canned analyses means you will have to deal with some nitty-gritty details. In this case we recreate an analysis done by [https://doi.org/10.1006/mpev.1998.0531 Vandenbussche et al. (1998)] that required software to be written de novo just for that paper. My hope is that you will remember this second analysis when it hits you some day that it would be really nice to test this beautiful hypothesis you've come up with, but can't think of any software designed to test that particular hypothesis and writing your own software is not within the realm of possibility. With HyPhy you may just be able to do it!
  
Although HyPhy can utilize the same standard models found in PAUP*, it also lets you to do some interesting and useful things that PAUP* cannot, such as (1) use codon and secondary structure models, and (2) allow the model of evolution to change across a tree.
+
== Login to Xanadu ==
  
=== Loading data into HyPhy ===
+
Login to Xanadu and request a machine as usual:
 +
srun --pty -p mcbstudent --qos=mcbstudent bash
  
Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose ''File > Open > Open Data File'', then navigate to and select the wickett.nex data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the '''Data window''' from this point on.
+
== Loading modules needed ==
  
=== Creating a partition ===
+
Load the module needed for this exercise:
 +
module load hyphy/2.5.4
 +
module load paup/4.0a-166
 +
module load python/2.7.8
  
HyPhy thinks of your data as being composed of one or more '''partitions'''. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene).
+
== Part 1 ==
  
==== The word partition is used in two ways ====
+
Start by creating a new directory called <tt>norm</tt>. Download the dataset (wickett.nex) and tree (wickett.newick) we will use for this analysis using curl:
The word partition is ambiguous: it used to mean "wall" or "divider" but with the advent of computer hard drives, it has also come to mean the space between the walls or dividers. When someone says they ''partitioned their data'', they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they ''applied a GTR+I+G model to the rbcL partition'', they have now switched to using the word partition to mean the sites on the rbcL side of the divider.
+
  
==== No partitioning implies one partition! ====
+
cd ~/norm
Even if you choose to '''not''' partition your data in HyPhy, you must go through the motions of creating a single partition because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose ''Edit > Select All'' from the Data window menu, then choose ''Data > Selection->Partition'' to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".
+
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex
 +
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.tre
  
==== Assign a data type to your partition ====
+
This dataset contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort ''Aneura mirabilis'') that is non-green and does not photosynthesize. The sequences are of a gene important for photosynthesis. The question is, can we detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this)?
Now that you have a partition, you can create a model for it. Under ''Partition Type'', choose ''codon'' (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is ''Di-nucl.'', which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.
+
  
==== Assign a tree topology to your partition ====
+
=== PhyloTree ===
Under Tree Topology, you have several options. Because a tree topology was defined in the wickett.nex data file, this tree topology shows up in the drop-down list as wickett''_tree''. Choose wickett_tree as the tree topology for your partition.
+
  
==== Assign a substitution model to your partition ====
+
The question we are asking is commonly asked, and is thus available as a canned analysis ([http://www.hyphy.org/methods/selection-methods/#relax RELAX]) in HyPhy.
The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is ''MG94xHKY85_3x4''. This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synon. and non-synon. substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.
+
  
The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:
+
Before running HyPhy, take a look at the wickett.tre file:
* the frequency of A nucleotides at first positions
+
cat wickett.tre
* the frequency of G nucleotides at second positions
+
* the frequency of T nucleotides at third positions
+
(Note: HyPhy corrects these for the fact that the three stop codons are not included.)
+
This involves estimating the '''4''' nucleotides frequencies at each of the '''3''' codon positions, hence the '''3x4''' in the name.
+
  
==== Local vs. global ====
+
Note that the newick tree description has been modified in that three edges in the tree have been decorated with <tt>{Foreground}</tt>. This tells HyPhy which edges of the tree are suspected of being under relaxed selection. HyPhy has an online tool that makes it easy to decorate your tree accurately. Try downloading the [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.tre wickett.tre] file to your own computer and then uploading it to the [http://phylotree.hyphy.org PhyloTree web app]. Use the <tt>Input text</tt> option from the Newick dropdown menu to choose the wickett.tre file you downloaded. Note how 3 edges are labeled with the <tt>Foreground</tt> tag. Try selecting a different set of edges and selecting Export from the Newick dropdown (don't worry, this will not affect the original file). Note how PhyloTree provides you with the correctly decorated tree description that you can simply copy and paste into your tree file. You need not use the tag <tt>Foreground</tt>; you can use whatever name you like as long as you supply that name for the <tt>--test</tt> option when invoking HyPhy.
You have one last decision to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. '''Local''' means that HyPhy will estimate some substitution model parameters for every branch in the tree. '''Global''' means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).
+
  
For now, tell HyPhy to use the Global option here. We will later change this to Local for comparison.
+
=== RELAX ===
  
=== Computing the likelihood under a global codon model ===
+
Run HyPhy as follows:
 +
hyphy relax --alignment wickett.nex --tree wickett.tre --models Minimal --test Foreground
 +
To save time, we will just do the Minimal analysis. Note that <tt>Foreground</tt> is the tag supplied via the <tt>--test</tt> option. If you had used a different name to tag the interesting edges, you should supply that name here rather than <tt>Foreground</tt>.
  
You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose ''Likelihood > Build'' to build a likelihood function, then ''Likelihood > Optimize'' to optimize the likelihood function (i.e. search for the highest point, thus obtaining maximum likelihood estimates of all parameters).
+
What does the HyPhy RELAX analysis conclude? Look for the section of output near the bottom that begins
 +
## Test for relaxation (or intensification) of selection [RELAX]
  
==== Saving the results ====
+
While the RELAX analysis is not the same test Norm et al. carried out, the conclusion is the same. This gene formerly important for photosynthesis is experiencing much less stabilizing selection along the edges leading to or within the parasite clade compared to the reference edges in the remainder of the tree.
When HyPhy has finished optimizing (this will take several minutes), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the '''Parameters window''') showing you values for all the quantities it estimated. At the top of this window will be various buttons followed by a drop-down list box. In the drop-down list, choose ''Save LF state''. Name this Likelihood Function state "global" to remind yourself later that this result was for the model in which all parameters were declared shared (i.e. global).
+
  
Now go back to the drop-down list and choose ''Select as null''. This tells HyPhy that we will later want to perform a likelihood ratio test in which this model is the null model.
+
== Part 2 ==
  
Click on the HYPHY Console window to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions on a piece of paper for later reference:
+
In this part, we will recreate the very interesting parametric bootstrapping analysis performed in the paper by [https://doi.org/10.1006/mpev.1998.0531 Vandenbussche et al. (1998)]. In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.
* ''What is the maximum log-likelihood under this model?''
+
* ''What is the ratio of the nonsynonymous rate to the synonymous rate?'' (Hint: this parameter is labeled "wickett_part_Shared_R" - the "R" presumably stands for "Ratio")
+
* ''How many local parameters does HyPhy say it estimated?'' ''What are these local parameters?'' (Hint: for n taxa, there are 2n-3 branches)
+
* ''How many global parameters does HyPhy say it estimated?''
+
  
==== Viewing the tree ====
+
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.
The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths). There is not much else you can do with the tree window at this point, so close it for now.
+
  
=== Computing the likelihood under a local codon model ===
+
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine.
  
Now let's switch to using a local model; i.e. a model in which each branch has its own value of omega, the ratio of nonsynonymous rate to synonymous rate. To accomplish this, return to the Data window and change (under the Parameters column) "Global" to "Local". HyPhy will issue a stern warning about this action invalidating the likelihood function, but that's ok because you already saved the previous likelihood function under the name "global". Just press the Ok button to proceed.
+
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.
  
Rebuild the likelihood function (by choosing ''Likelihood > Build'' from the menu at the top of the Data window), then optimize the likelihood function using ''Likelihood > Optimize''. The optimization will take longer this time because there are many more parameters to estimate.
+
== Creating the HyPhy batch file ==
  
While you are waiting, you can answer these questions using the output in the HYPHY Console window:
+
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your  <tt>bats</tt> directory.
* ''How many parameters does HyPhy now say are shared?''
+
* ''Which shared parameters are missing this time that were present before when you performed the global analysis?''
+
* ''How many parameters are now local?''
+
  
==== Saving the results ====
+
Download the data we will use for this analysis using curl:
Once the parameter window pops up again, indicating that HyPhy is finished with its optimization, answer the following questions (some of the answers you will need to get from the HYPHY console window):
+
* ''What is the maximized log-likelihood for the local model?''
+
* ''Does this model fit the data better than the global model?''
+
* ''What is the difference between the log-likelihood of this model and the log-likelihood of the previous model?''
+
* ''What is the likelihood ratio test statistic for this comparison?''
+
* ''How many degrees of freedom does this likelihood ratio test have?''
+
* ''Does this model fit the data significantly better than the global model?''
+
  
== Part C: Testing evolutionary phylogenetic hypotheses ==
+
cd ~/bats
 +
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex
  
[[Category:EEB courses]]
+
This is what your directory structure should look like at this point:
[[Category:Phylogenetics]]
+
bats/
 +
    simdata/
 +
    bats.bf
 +
    irbp.nex
 +
 
 +
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:
 +
 
 +
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; 
 +
ACCEPT_ROOTED_TREES = TRUE; 
 +
ASSUME_REVERSIBLE_MODELS = -1;
 +
 +
/*****************************************************************************************
 +
| Read in the data and store the result in the variable nucleotideSequences.
 +
*/
 +
DataSet nucleotideSequences = ReadDataFile("irbp.nex");
 +
 +
/*****************************************************************************************
 +
| Filter the data, specifying which sites are to be used. The first 1 means treat each
 +
| site separately (3 would cause the data to be interpreted as codons). The quoted
 +
| string (last argument) specifies which sites (where first site = 0) to use (we are
 +
| excluding sites with a lot of missing data and or alignment issues). This leaves us with
 +
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine
 +
| exactly which sites were excluded in the original study.
 +
*/
 +
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");
 +
 +
/*****************************************************************************************
 +
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means
 +
| unit=1, atom=1, position-specific=1. These settings create one global set of
 +
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies,
 +
| which is not what we need because we will not be using a codon model).
 +
*/
 +
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);
 +
fprintf(stdout, observedFreqs);
 +
 
 +
=== Run bats.bf ===
 +
 
 +
Run your <tt>bats.bf</tt> file as follows:
 +
hyphy bats.bf
 +
 
 +
You should see the empirical nucleotide frequencies displayed as follows:
 +
{
 +
{0.1936054742803209}
 +
{0.3104845052697813}
 +
{0.3106418121755545}
 +
{0.1852682082743432}
 +
}
 +
 
 +
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.
 +
 
 +
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.
 +
 
 +
=== Define the HKY85 rate matrix ===
 +
 
 +
Add the following to the bottom of your <tt>bats.bf</tt> file:
 +
/*****************************************************************************************
 +
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be
 +
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is
 +
| declared to be global, meaning it is shared by all edges.
 +
*/
 +
global kappa = 4.224618;
 +
HKY85RateMatrix =
 +
{{    *      ,    betat    ,    betat*kappa  ,    betat    }
 +
{    betat    ,    *      ,      betat    ,  betat*kappa }
 +
{ betat*kappa ,    betat    ,        *      ,    betat    }
 +
{    betat    , betat*kappa ,      betat    ,      *    }};
 +
fprintf(stdout, HKY85RateMatrix);
 +
 
 +
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?
 +
 
 +
=== Combine frequencies with rate matrix to create an HKY85 model ===
 +
 
 +
/*****************************************************************************************
 +
| Define the HKY85 model, by combining the substitution matrix with the vector of
 +
| empirical frequencies.
 +
*/
 +
Model hky1 = (HKY85RateMatrix, observedFreqs);
 +
 
 +
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree.
 +
 
 +
=== Create a tree representing the null hypothesis ===
 +
 
 +
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.
 +
 
 +
/*****************************************************************************************
 +
| Define the tree variable, using the tree description read from the data file.
 +
| By default, the last defined model (hky1) is assigned to all edges of the tree.
 +
*/
 +
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";
 +
 
 +
=== Creating the second model ===
 +
 
 +
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:
 +
 
 +
GetInformation(modelMap, constrainedTree);
 +
fprintf(stdout, modelMap);
 +
 
 +
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:
 +
{
 +
  "Bos":"hky1",
 +
  "Cynocephalus":"hky1",
 +
  "Didelphis":"hky1",
 +
  "Felis":"hky1",
 +
  "Galago":"hky1",
 +
  "Homo":"hky1",
 +
  "Mus":"hky1",
 +
  "Node1":"hky1",
 +
  "Node10":"hky1",
 +
  "Node11":"hky1",
 +
  "Node14":"hky1",
 +
  "Node19":"hky1",
 +
  "Node2":"hky1",
 +
  "Node3":"hky1",
 +
  "Node4":"hky1",
 +
  "Node5":"hky1",
 +
  "Node7":"hky1",
 +
  "Oryctolagus":"hky1",
 +
  "Pteropus":"hky1",
 +
  "Tarsius":"hky1",
 +
  "Tonatia_bidens":"hky1",
 +
  "Tonatia_silvicola":"hky1",
 +
  "Tupaia":"hky1"
 +
}
 +
 
 +
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:
 +
 
 +
/*****************************************************************************************
 +
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges.
 +
*/
 +
highATfreqs = {{.45}{.05}{.05}{.45}};
 +
Model hky2 = (HKY85RateMatrix, highATfreqs);
 +
SetParameter(constrainedTree.microbats, MODEL, hky2);
 +
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);
 +
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);
 +
SetParameter(constrainedTree.Pteropus, MODEL, hky2);
 +
 
 +
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output:
 +
{
 +
  "Bos":"hky1",
 +
  "Cynocephalus":"hky1",
 +
  "Didelphis":"hky1",
 +
  "Felis":"hky1",
 +
  "Galago":"hky1",
 +
  "Homo":"hky1",
 +
  "Mus":"hky1",
 +
  "Node1":"hky1",
 +
  "Node10":"hky1",
 +
  "Node11":"hky1",
 +
  "Node14":"hky1",
 +
  "Node2":"hky1",
 +
  "Node3":"hky1",
 +
  "Node4":"hky1",
 +
  "Node5":"hky1",
 +
  "Node7":"hky1",
 +
  "Oryctolagus":"hky1",
 +
  "Pteropus":"hky2",
 +
  "Tarsius":"hky1",
 +
  "Tonatia_bidens":"hky2",
 +
  "Tonatia_silvicola":"hky2",
 +
  "Tupaia":"hky1",
 +
  "microbats":"hky2"
 +
}
 +
 
 +
=== Don't get too comfortable just yet ===
 +
 
 +
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:
 +
 
 +
fprintf(stdout, "\n\nCurrent edge lengths:\n");
 +
edgeNames = BranchName(constrainedTree,-1);
 +
edgeLengths = BranchLength(constrainedTree,-1);
 +
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {
 +
fprintf(stdout, Format(edgeLengths[k],10,5), "  ", edgeNames[k], "\n");
 +
}
 +
 
 +
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.
 +
 
 +
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)
 +
 
 +
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.
 +
 
 +
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?
 +
 
 +
=== Fixing up edge lengths ===
 +
 
 +
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.
 +
 
 +
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.
 +
 
 +
betat = 1.0;
 +
scalingFactor = 0.0;
 +
for (n1 = 0; n1 < 4; n1 = n1+1) {
 +
    for (n2 = 0; n2 < 4; n2 = n2+1) {
 +
        if (n2!=n1) {
 +
            scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];
 +
        }
 +
    }
 +
}
 +
 
 +
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.
 +
 
 +
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:
 +
constrainedTree.microbats.betat          := 0.097851/scalingFactor;
 +
constrainedTree.Tonatia_bidens.betat  := 0.008252/scalingFactor;
 +
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;
 +
constrainedTree.Pteropus.betat            := 0.104663/scalingFactor;
 +
 
 +
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.
 +
 
 +
=== Constructing the likelihood function ===
 +
 
 +
/*****************************************************************************************
 +
| Build the likelihood function and print it out, which causes HyPhy to actually compute
 +
| the log-likelihood for the tree.
 +
*/
 +
LikelihoodFunction likelihood = (filteredData, constrainedTree);
 +
fprintf(stdout, likelihood);
 +
 
 +
If you add these lines to the end of your growing <tt>bats.bf</tt> file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.
 +
 
 +
The  <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.
 +
 
 +
=== Ready to simulate! ===
 +
 
 +
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.
 +
 
 +
Here is the simulation loop:
 +
/*****************************************************************************************
 +
| Perform the simulations.
 +
*/
 +
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {
 +
    // Simulate a data set of the same size as the original set
 +
    DataSet simulatedData = SimulateDataSet(likelihood);
 +
 +
    // The filter is necessary, but trivial in this case because alls sites are used
 +
    DataSetFilter filteredSimData = CreateFilter(simulatedData,1);
 +
 +
    // Save simulated data to a file
 +
    outFile = "simdata/sim"+simCounter+".nex";
 +
    fprintf(outFile, filteredSimData);
 +
}
 +
 
 +
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.
 +
 
 +
== Using PAUP* to estimate parsimony trees for each simulated data set ==
 +
 
 +
HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.
 +
 
 +
Create the following file in the same directory as <tt>bats.bf</tt>. Call the new file <tt>parsimony.nex</tt>.
 +
#nexus
 +
 +
begin paup;
 +
    log file=simdata/pauplog.txt start replace;
 +
    set criterion=parsimony warnreset=no warntsave=no;
 +
end;
 +
 +
begin python;
 +
    for i in range(10):
 +
        paup.cmd("exe simdata/sim%d.nex;" % i)
 +
        paup.cmd("hsearch;")
 +
        paup.cmd("contree all;")
 +
        paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")
 +
        paup.cmd("filter constraints=flyingdna;")
 +
end;
 +
 +
begin paup;
 +
    log stop;
 +
    quit;
 +
end;
 +
 
 +
This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.
 +
 
 +
The first paup block starts a log (which will be created in the <tt>simdata</tt> directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.
 +
 
 +
The final paup bock just closes the log file and then quits.
 +
 
 +
The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a <tt>paup.cmd</tt> function call.
 +
 
 +
The first <tt>paup.cmd</tt> executes a file named <tt>sim%d.nex</tt> inside the simdata directory, where <tt>%d</tt> is replaced by the value of <tt>i</tt>. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.
 +
 
 +
The second <tt>paup.cmd</tt> performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.
 +
 
 +
The third <tt>paup.cmd</tt> creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.
 +
 
 +
We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named <tt>flyingdna</tt> and then filter the trees resulting from the parsimony search using the <tt>flyingdna</tt> constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.
 +
 
 +
Run this file in PAUP* to generate the <tt>pauplog.txt</tt> file, then look through that file to see how many replicates yielded bat monophyly.
 +
 
 +
== Final exercise ==
 +
 
 +
Once you confirm that your scripts are working, run your <tt>bats.bf</tt> using HyPhy followed by running <tt>parsimony.nex</tt> in PAUP* a total of 6 times, each with a different value of <tt>highATfreqs</tt> that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both <tt>bats.bf</tt> and <tt>parsimony.nex</tt> so that you get more accurate y-axis values.
 +
 
 +
Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file <tt>pauplog.txt</tt>:
 +
cat simdata/pauplog.txt | grep "Number of trees retained by filter"
 +
The <tt>cat</tt> command simply dumps a file to the screen. Instead of sending the output to the console, the (<tt>|</tt>) causes the output to instead be piped into the <tt>grep</tt> command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.
 +
 
 +
How does your plot compare to the one published in Vandenbussche et al. (1998)?
 +
 
 +
== Literature Cited ==
 +
 
 +
NJ Wickett, Y Fan, P Lewis, and B Goffinet. 2008. Distribution and evolution of pseudogenes, gene losses, and a gene rearrangement in the plastid genome of the nonphotosynthetic liverwort, Aneura mirabilis (Metzgeriales, Jungermanniopsida). Journal of Molecular Evolution 67:111-122. [http://dx.doi.org/10.1007/s00239-008-9133-1 DOI:10.1007/s00239-008-9133-1]
 +
 
 +
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional
 +
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular
 +
Phylogenetics and Evolution 13(3): 408-416. [https://doi.org/10.1006/mpev.1998.0531 DOI:10.1006/mpev.1998.0531]
 +
 
 +
[[Category: Phylogenetics]]

Latest revision as of 12:02, 26 February 2020

Adiantum.png EEB 349: Phylogenetics

Goals

The goal of this lab exercise is to show you how to use the HyPhy program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.

The lab is divided into two parts, both of which recreate analyses done in past papers.

The first part is designed to show you how easy it is to use HyPhy when what you want to do is available as a canned routine. In this example, we recreate part of the paper by Wickett et al. 2008 in which the goal was to demonstrate that selection is relaxed in a gene no longer being used. You may remember that Norm Wickett gave the keynote address at this year's EEB Graduate Student Symposium. This study was done when he was still here at UConn.

The second part shows that HyPhy allows you to test hypotheses that are nearly impossible to do with any other software, although going outside HyPhy's collection of canned analyses means you will have to deal with some nitty-gritty details. In this case we recreate an analysis done by Vandenbussche et al. (1998) that required software to be written de novo just for that paper. My hope is that you will remember this second analysis when it hits you some day that it would be really nice to test this beautiful hypothesis you've come up with, but can't think of any software designed to test that particular hypothesis and writing your own software is not within the realm of possibility. With HyPhy you may just be able to do it!

Login to Xanadu

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Loading modules needed

Load the module needed for this exercise:

module load hyphy/2.5.4
module load paup/4.0a-166
module load python/2.7.8

Part 1

Start by creating a new directory called norm. Download the dataset (wickett.nex) and tree (wickett.newick) we will use for this analysis using curl:

cd ~/norm
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.tre

This dataset contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort Aneura mirabilis) that is non-green and does not photosynthesize. The sequences are of a gene important for photosynthesis. The question is, can we detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this)?

PhyloTree

The question we are asking is commonly asked, and is thus available as a canned analysis (RELAX) in HyPhy.

Before running HyPhy, take a look at the wickett.tre file:

cat wickett.tre

Note that the newick tree description has been modified in that three edges in the tree have been decorated with {Foreground}. This tells HyPhy which edges of the tree are suspected of being under relaxed selection. HyPhy has an online tool that makes it easy to decorate your tree accurately. Try downloading the wickett.tre file to your own computer and then uploading it to the PhyloTree web app. Use the Input text option from the Newick dropdown menu to choose the wickett.tre file you downloaded. Note how 3 edges are labeled with the Foreground tag. Try selecting a different set of edges and selecting Export from the Newick dropdown (don't worry, this will not affect the original file). Note how PhyloTree provides you with the correctly decorated tree description that you can simply copy and paste into your tree file. You need not use the tag Foreground; you can use whatever name you like as long as you supply that name for the --test option when invoking HyPhy.

RELAX

Run HyPhy as follows:

hyphy relax --alignment wickett.nex --tree wickett.tre --models Minimal --test Foreground

To save time, we will just do the Minimal analysis. Note that Foreground is the tag supplied via the --test option. If you had used a different name to tag the interesting edges, you should supply that name here rather than Foreground.

What does the HyPhy RELAX analysis conclude? Look for the section of output near the bottom that begins

## Test for relaxation (or intensification) of selection [RELAX]

While the RELAX analysis is not the same test Norm et al. carried out, the conclusion is the same. This gene formerly important for photosynthesis is experiencing much less stabilizing selection along the edges leading to or within the parasite clade compared to the reference edges in the remainder of the tree.

Part 2

In this part, we will recreate the very interesting parametric bootstrapping analysis performed in the paper by Vandenbussche et al. (1998). In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.

The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.

This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine.

Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.

Creating the HyPhy batch file

HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named bats.bf in a directory called bats (you can of course use any name you like, but these are the names I will be using). Also create a directory simdata inside your bats directory.

Download the data we will use for this analysis using curl:

cd ~/bats
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex

This is what your directory structure should look like at this point:

bats/
    simdata/
    bats.bf
    irbp.nex

Open bats.bf in your favorite text editor (e.g. nano) and enter the following text:

AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1;  
ACCEPT_ROOTED_TREES = TRUE;  
ASSUME_REVERSIBLE_MODELS = -1;

/*****************************************************************************************
| Read in the data and store the result in the variable nucleotideSequences.
*/
DataSet nucleotideSequences = ReadDataFile("irbp.nex");

/*****************************************************************************************
| Filter the data, specifying which sites are to be used. The first 1 means treat each 
| site separately (3 would cause the data to be interpreted as codons). The quoted 
| string (last argument) specifies which sites (where first site = 0) to use (we are 
| excluding sites with a lot of missing data and or alignment issues). This leaves us with
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine
| exactly which sites were excluded in the original study.
*/
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");

/*****************************************************************************************
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means
| unit=1, atom=1, position-specific=1. These settings create one global set of 
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, 
| which is not what we need because we will not be using a codon model).
*/
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);
fprintf(stdout, observedFreqs);

Run bats.bf

Run your bats.bf file as follows:

hyphy bats.bf

You should see the empirical nucleotide frequencies displayed as follows:

{
{0.1936054742803209}
{0.3104845052697813}
{0.3106418121755545}
{0.1852682082743432}
}

I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (/* comment */) or begin the line with two slashes (// comment). The three lines at the beginning will be explained when each becomes relevant.

The fprintf command also mimics the C programming language. It allows you to print objects to stdout (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable observedFreqs.

Define the HKY85 rate matrix

Add the following to the bottom of your bats.bf file:

/*****************************************************************************************
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is 
| declared to be global, meaning it is shared by all edges.
*/
global kappa = 4.224618;
HKY85RateMatrix = 
		{{     *      ,    betat    ,    betat*kappa  ,     betat    }
		 {    betat    ,     *      ,       betat     ,  betat*kappa }
		 { betat*kappa ,    betat    ,         *      ,     betat    }
		 {    betat    , betat*kappa ,       betat     ,       *     }};
fprintf(stdout, HKY85RateMatrix);

Run bats.bf in HyPhy. Did the HKY85RateMatrix variable have the value you expected? Why or why not? (Hint: the variable betat was initialized to zero.) Set betat = 1; before the fprintf statement and run the batch file again to see the effect. Does the output make sense now?

Combine frequencies with rate matrix to create an HKY85 model

/*****************************************************************************************
| Define the HKY85 model, by combining the substitution matrix with the vector of 
| empirical frequencies. 
*/
Model hky1 = (HKY85RateMatrix, observedFreqs);

Now we have a model variable (hky1) that can be applied to each edge of a tree.

Create a tree representing the null hypothesis

The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.

/*****************************************************************************************
| Define the tree variable, using the tree description read from the data file.
| By default, the last defined model (hky1) is assigned to all edges of the tree. 
*/
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";

Creating the second model

Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the hky1 model has been applied to every node in the tree. You can verify this using the following HBL code:

GetInformation(modelMap, constrainedTree);
fprintf(stdout, modelMap);

Add these two lines to the bottom of your bats.bf file and run it. You should see this near the bottom of your output:

{
 "Bos":"hky1",
 "Cynocephalus":"hky1",
 "Didelphis":"hky1",
 "Felis":"hky1",
 "Galago":"hky1",
 "Homo":"hky1",
 "Mus":"hky1",
 "Node1":"hky1",
 "Node10":"hky1",
 "Node11":"hky1",
 "Node14":"hky1",
 "Node19":"hky1",
 "Node2":"hky1",
 "Node3":"hky1",
 "Node4":"hky1",
 "Node5":"hky1",
 "Node7":"hky1",
 "Oryctolagus":"hky1",
 "Pteropus":"hky1",
 "Tarsius":"hky1",
 "Tonatia_bidens":"hky1",
 "Tonatia_silvicola":"hky1",
 "Tupaia":"hky1"
}

First, create a second model named hky2 and apply it to four specific edges in the tree:

/*****************************************************************************************
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. 
*/
highATfreqs = {{.45}{.05}{.05}{.45}};
Model hky2 = (HKY85RateMatrix, highATfreqs);
SetParameter(constrainedTree.microbats, MODEL, hky2);
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);
SetParameter(constrainedTree.Pteropus, MODEL, hky2);

If you move your GetInformation call and its accompanying fprintf after these 6 lines, then you should see this in the output:

{
 "Bos":"hky1",
 "Cynocephalus":"hky1",
 "Didelphis":"hky1",
 "Felis":"hky1",
 "Galago":"hky1",
 "Homo":"hky1",
 "Mus":"hky1",
 "Node1":"hky1",
 "Node10":"hky1",
 "Node11":"hky1",
 "Node14":"hky1",
 "Node2":"hky1",
 "Node3":"hky1",
 "Node4":"hky1",
 "Node5":"hky1",
 "Node7":"hky1",
 "Oryctolagus":"hky1",
 "Pteropus":"hky2",
 "Tarsius":"hky1",
 "Tonatia_bidens":"hky2",
 "Tonatia_silvicola":"hky2",
 "Tupaia":"hky1",
 "microbats":"hky2"
}

Don't get too comfortable just yet

It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:

fprintf(stdout, "\n\nCurrent edge lengths:\n");
edgeNames 	= BranchName(constrainedTree,-1); 
edgeLengths	= BranchLength(constrainedTree,-1);
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {
	fprintf(stdout, Format(edgeLengths[k],10,5), "  ", edgeNames[k], "\n");
}

The first line simply skips a couple of lines (\n\n) and prints a header announcing that current edge lengths will follow.

The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The -1 in these functions is somewhat obscure, but means "give me all of them". (If you used 5 rather than -1, it would give you the name of the edge having index 5.)

The last 3 lines are a loop in which the variable k ranges from 0 to the number of edges minus 1. For each value of k, the fprintf statement prints out the length of edge k (the Format command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge k. The newline character ("\n") at the end of the fprintf statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.

After running bats.bf, what possibly important detail do you notice about the lengths of the edges to which we attached the hky2 model?

Fixing up edge lengths

The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the betat parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1 that we placed at the top of the file saves us from having to go through this procedure for all the hky1 edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.

Recall that, under the JC69 model, the edge length equals v = 3*beta*t. The betat parameter that appears in our HKY85RateMatrix is the beta*t part, so to set the betat parameter we need to divide the desired edge length by 3: that is, betat = v/3. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to betat values.

betat = 1.0;
scalingFactor = 0.0;
for (n1 = 0; n1 < 4; n1 = n1+1) {
    for (n2 = 0; n2 < 4; n2 = n2+1) {
        if (n2!=n1) {
            scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];
        }
    }
}

The two nested loops visit every off-diagonal element of the HKY85RateMatrix, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum scalingFactor. Review your lecture notes if you don't remember why scalingFactor is computed in this way. Note that it is important to set betat = 1 before the loop in order to ensure that the scaling factor works correctly.

Now we simply need to set the betat values for the four edges of interest:

constrainedTree.microbats.betat           := 0.097851/scalingFactor;
constrainedTree.Tonatia_bidens.betat   := 0.008252/scalingFactor;
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;
constrainedTree.Pteropus.betat             := 0.104663/scalingFactor;

Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.

Constructing the likelihood function

/*****************************************************************************************
| Build the likelihood function and print it out, which causes HyPhy to actually compute
| the log-likelihood for the tree.
*/
LikelihoodFunction likelihood = (filteredData, constrainedTree);
fprintf(stdout, likelihood);

If you add these lines to the end of your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to -6472.478318.

The ASSUME_REVERSIBLE_MODELS = -1 that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The ACCEPT_ROOTED_TREES = TRUE at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.

Ready to simulate!

We are now ready to perform the simulations. As soon as you create a LikelihoodFunction object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.

Here is the simulation loop:

/*****************************************************************************************
| Perform the simulations. 
*/
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {
    // Simulate a data set of the same size as the original set
    DataSet simulatedData = SimulateDataSet(likelihood);

    // The filter is necessary, but trivial in this case because alls sites are used 
    DataSetFilter filteredSimData = CreateFilter(simulatedData,1);

    // Save simulated data to a file
    outFile = "simdata/sim"+simCounter+".nex";
    fprintf(outFile, filteredSimData);
}

I have used both styles of comments here: the main comment before the loop is done using the /* ... */ style, and the double-slash style is used for comments within the loop. The LikelihoodFunction object likelihood is passed to the SimulateDataSet function to generate the data. It is always necessary to create a DataSetFilter in HyPhy, even if no filtering occurs. If one prints filteredSimData using fprintf to stdout, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving fprintf a file name rather than stdout, which causes the simulated data to be saved to that file. The file name is constructed using simCounter so that every simulated file has a name that is different. Note that all simulated data will be saved in the simdata directory that you created early on in this tutorial.

Using PAUP* to estimate parsimony trees for each simulated data set

HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.

Create the following file in the same directory as bats.bf. Call the new file parsimony.nex.

#nexus

begin paup;
    log file=simdata/pauplog.txt start replace;
    set criterion=parsimony warnreset=no warntsave=no;
end;

begin python;
    for i in range(10):
        paup.cmd("exe simdata/sim%d.nex;" % i)
        paup.cmd("hsearch;")
        paup.cmd("contree all;")
        paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")
        paup.cmd("filter constraints=flyingdna;")
end;

begin paup;
    log stop;
    quit;
end;

This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.

The first paup block starts a log (which will be created in the simdata directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.

The final paup bock just closes the log file and then quits.

The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a paup.cmd function call.

The first paup.cmd executes a file named sim%d.nex inside the simdata directory, where %d is replaced by the value of i. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.

The second paup.cmd performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.

The third paup.cmd creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.

We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named flyingdna and then filter the trees resulting from the parsimony search using the flyingdna constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.

Run this file in PAUP* to generate the pauplog.txt file, then look through that file to see how many replicates yielded bat monophyly.

Final exercise

Once you confirm that your scripts are working, run your bats.bf using HyPhy followed by running parsimony.nex in PAUP* a total of 6 times, each with a different value of highATfreqs that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both bats.bf and parsimony.nex so that you get more accurate y-axis values.

Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file pauplog.txt:

cat simdata/pauplog.txt | grep "Number of trees retained by filter"

The cat command simply dumps a file to the screen. Instead of sending the output to the console, the (|) causes the output to instead be piped into the grep command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.

How does your plot compare to the one published in Vandenbussche et al. (1998)?

Literature Cited

NJ Wickett, Y Fan, P Lewis, and B Goffinet. 2008. Distribution and evolution of pseudogenes, gene losses, and a gene rearrangement in the plastid genome of the nonphotosynthetic liverwort, Aneura mirabilis (Metzgeriales, Jungermanniopsida). Journal of Molecular Evolution 67:111-122. DOI:10.1007/s00239-008-9133-1

RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular Phylogenetics and Evolution 13(3): 408-416. DOI:10.1006/mpev.1998.0531