Difference between revisions of "Phylogenetics: HyPhy Lab"

From EEBedia
Jump to: navigation, search
(Obtaining the sequences)
m (Create a tree representing the null hypothesis)
 
(53 intermediate revisions by 2 users not shown)
Line 4: Line 4:
 
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span>
 
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span>
 
|-
 
|-
|The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework.
+
|
 
|}
 
|}
 +
== Goals ==
  
== Obtaining the sequences ==
+
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.
  
A Nexus data file containing sequences and a tree is located here: [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex wickett.nex]. This dataset was assembled by former UConn EEB graduate student [http://hydrodictyon.eeb.uconn.edu/people/wickett/ Norm Wickett] and contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort <em>Aneura mirabilis</em>) that is non-green and does not photosynthesize. Today's lab will recreate the type of analysis Norm carried out in [http://dx.doi.org/10.1007/s00239-008-9133-1 his 2008 paper in Journal of Molecular Evolution (67:111-122)].
+
The lab is divided into two parts, both of which recreate analyses done in past papers.  
  
The sequences are of a gene important for photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).
+
The '''first part''' is designed to show you '''how easy it is to use HyPhy''' when what you want to do is available as a canned routine. In this example, we recreate part of the paper by [http://dx.doi.org/10.1007/s00239-008-9133-1 Wickett et al. 2008 ] in which the goal was to demonstrate that selection is relaxed in a gene no longer being used. You may remember that  [http://www.chicagobotanic.org/research/staff/wickett Norm Wickett] gave the keynote address at this year's EEB Graduate Student Symposium. This study was done when he was still here at UConn.
  
== Using a codon model ==
+
The '''second part''' shows that HyPhy allows you to test hypotheses that are nearly impossible to do with any other software, although going outside HyPhy's collection of canned analyses means you will have to deal with some nitty-gritty details. In this case we recreate an analysis done by [https://doi.org/10.1006/mpev.1998.0531 Vandenbussche et al. (1998)] that required software to be written de novo just for that paper. My hope is that you will remember this second analysis when it hits you some day that it would be really nice to test this beautiful hypothesis you've come up with, but can't think of any software designed to test that particular hypothesis and writing your own software is not within the realm of possibility. With HyPhy you may just be able to do it!
  
Although HyPhy can utilize the same standard models found in PAUP*, it also lets you to do some interesting and useful things that PAUP* cannot, such as (1) use codon and secondary structure models, and (2) allow the model of evolution to change across a tree.
+
== Login to Xanadu ==
  
== Downloading and installing HyPhy ==
+
Login to Xanadu and request a machine as usual:
 +
srun --pty -p mcbstudent --qos=mcbstudent bash
  
HyPhy is available for Mac and Windows from the [http://www.hyphy.org/ HyPhy home page].
+
== Loading modules needed ==
  
== Loading data into HyPhy ==
+
Load the module needed for this exercise:
 +
module load hyphy/2.5.4
 +
module load paup/4.0a-166
 +
module load python/2.7.8
  
Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose ''File > Open > Open Data File'', then navigate to and select the <tt>wickett.nex</tt> data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the '''Data window''' from this point on.
+
== Part 1 ==
  
== Creating a partition ==
+
Start by creating a new directory called <tt>norm</tt>. Download the dataset (wickett.nex) and tree (wickett.newick) we will use for this analysis using curl:
  
HyPhy thinks of your data as being composed of one or more '''partitions'''. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene).  
+
cd ~/norm
 +
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex
 +
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.tre
  
=== The word partition is used in two ways ===
+
This dataset contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort ''Aneura mirabilis'') that is non-green and does not photosynthesize. The sequences are of a gene important for photosynthesis. The question is, can we detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this)?
The word partition is ambiguous: it formerly meant "wall" or "divider" but, with the advent of computer hard drives, it has also come to mean the space ''between'' the walls or dividers. When someone says they ''partitioned their data'', they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they ''applied a GTR+I+G model to the rbcL partition'', they have now switched to using the word partition to mean the sites on the rbcL side of the divider.
+
  
=== No partitioning implies one partition! ===
+
=== PhyloTree ===
Even if you choose to '''not''' partition (old meaning) your data in HyPhy, you must go through the motions of creating a single partition (new meaning) because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose ''Edit > Select All'' from the Data window menu, then choose ''Data > Selection->Partition'' to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".
+
  
=== Assign a data type to your partition ===
+
The question we are asking is commonly asked, and is thus available as a canned analysis ([http://www.hyphy.org/methods/selection-methods/#relax RELAX]) in HyPhy.
Now that you have a partition, you can create a model for it. Under the column name ''Partition Type'', choose ''codon'' (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is ''Di-nucl.'', which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.
+
  
=== Assign a tree topology to your partition ===
+
Before running HyPhy, take a look at the wickett.tre file:
Under Tree Topology, you have several options. Because a tree topology was defined in the <tt>wickett.nex</tt> data file, this tree topology shows up in the drop-down list as <tt>wickett_tree</tt>. Choose <tt>wickett_tree</tt> as the tree topology for your partition.
+
cat wickett.tre
  
=== Assign a substitution model to your partition ===
+
Note that the newick tree description has been modified in that three edges in the tree have been decorated with <tt>{Foreground}</tt>. This tells HyPhy which edges of the tree are suspected of being under relaxed selection. HyPhy has an online tool that makes it easy to decorate your tree accurately. Try downloading the [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.tre wickett.tre] file to your own computer and then uploading it to the [http://phylotree.hyphy.org PhyloTree web app]. Use the <tt>Input text</tt> option from the Newick dropdown menu to choose the wickett.tre file you downloaded. Note how 3 edges are labeled with the <tt>Foreground</tt> tag. Try selecting a different set of edges and selecting Export from the Newick dropdown (don't worry, this will not affect the original file). Note how PhyloTree provides you with the correctly decorated tree description that you can simply copy and paste into your tree file. You need not use the tag <tt>Foreground</tt>; you can use whatever name you like as long as you supply that name for the <tt>--test</tt> option when invoking HyPhy.
The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is ''MG94xHKY85_3x4'' (second from the bottom). This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synonymous and non-synonymous substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.
+
  
The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:
+
=== RELAX ===
* the frequency of A nucleotides at first positions
+
* the frequency of G nucleotides at second positions
+
* the frequency of T nucleotides at third positions
+
(Note: HyPhy corrects these for the fact that the three stop codons are not included.)
+
This involves estimating the '''4''' nucleotides frequencies at each of the '''3''' codon positions, hence the '''3x4''' in the name.
+
  
=== Local vs. global ===
+
Run HyPhy as follows:
You have only a couple more decisions to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. '''Local''' means that HyPhy will estimate some substitution model parameters for every branch in the tree. '''Global''' means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).
+
hyphy relax --alignment wickett.nex --tree wickett.tre --models Minimal --test Foreground
 +
To save time, we will just do the Minimal analysis. Note that <tt>Foreground</tt> is the tag supplied via the <tt>--test</tt> option. If you had used a different name to tag the interesting edges, you should supply that name here rather than <tt>Foreground</tt>.
  
Tell HyPhy to use the Local option (this should already be set correctly).
+
What does the HyPhy RELAX analysis conclude? Look for the section of output near the bottom that begins
 +
## Test for relaxation (or intensification) of selection [RELAX]
  
=== Equilibrium frequencies===
+
While the RELAX analysis is not the same test Norm et al. carried out, the conclusion is the same. This gene formerly important for photosynthesis is experiencing much less stabilizing selection along the edges leading to or within the parasite clade compared to the reference edges in the remainder of the tree.
You should also leave the equilibrium frequencies set to "Partition". This sets the equilibrium base frequencies to the empirical values (i.e. the frequency of A is the number of As observed in the entire partition divided by the total number of nucleotides in the partition). Other options include:
+
* Dataset, which would not be different than "Partition" in this case where there is only one partition defined,
+
* Equal, which sets all base frequencies equal to 0.25, and
+
* Estimate, which estimates the base frequencies
+
  
== Computing the likelihood under a local codon model ==
+
== Part 2 ==
  
You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose ''Likelihood > Build Function'' to build a likelihood function, then ''Likelihood > Optimize'' to optimize the likelihood function (i.e. search for the highest point on the likelihood surface, thus obtaining maximum likelihood estimates of all parameters).
+
In this part, we will recreate the very interesting parametric bootstrapping analysis performed in the paper by [https://doi.org/10.1006/mpev.1998.0531 Vandenbussche et al. (1998)]. In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.
  
=== Saving the results ===
+
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.
When HyPhy has finished optimizing (this will take several seconds to several minutes, depending on the speed of the computer you are using), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the '''Parameters window''') showing you values for all the quantities it estimated.  
+
  
Click on the '''HYPHY Console window''' to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions:
+
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine.  
<div style="background-color:#eeeeff">What is the maximum log-likelihood under this model?</div>
+
<div style="background-color:#eeeeff">How many shared (i.e. global) parameters does HyPhy say it estimated?</div>
+
<div style="background-color:#eeeeff">What are these global parameters?</div>
+
<div style="background-color:#eeeeff">How many local parameters does HyPhy say it estimated?</div>
+
<div style="background-color:#eeeeff">What are these local parameters?'' (Hint: for n taxa, there are 2n-3 branches)</div>
+
  
Switch back to the Parameters window now and look at the very bottom of the window to answer these questions:
+
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.
<div style="background-color:#eeeeff">What is the total number of parameters estimated?</div>
+
<div style="background-color:#eeeeff">What is the value of AIC reported by HyPhy?</div>
+
<div style="background-color:#eeeeff">Calculate the AIC yourself using this formula: AIC = -2*lnL + 2*nparams</div>
+
  
Before moving on, save a snapshot of the likelihood function with the current parameter values by choosing "Save LF state" from the drop-down list box at the top of the Parameters window. Choose the name "unconstrained" when asked. After saving the state of the likelihood function, choose "Select as alternative" from the same drop-down list. This will allow us to easily perform likelihood ratio tests using another, simpler model as the null model.
+
== Creating the HyPhy batch file ==
  
=== Viewing the tree and obtaining information about branches ===
+
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your  <tt>bats</tt> directory.  
The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths).  
+
  
The next step is to compare the unconstrained model (in which there are the same number of omega parameters as there are branches) with simpler models involving fewer omega parameters. For example, one model you will use in a few minutes allows the three branches in the parasite clade to evolve under one omega, while all other branches evolve under an omega value that is potentially different. For future reference, you should determine now what name HyPhy is using for the branch leading to the two parasite taxa.
+
Download the data we will use for this analysis using curl:
  
Click on the branch leading to the two parasites. It should turn into a dotted line. Now double-click this branch and you should get a dialog box popping up with every bit of information known about this branch:
+
cd ~/bats
<div style="background-color: #eeeeff">What is the branch id for this branch that leads to the two parasite sequences?</div>
+
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex
You can now close the "Branch Info" dialog box.
+
  
== Computing the likelihood under the most-constrained model ==
+
This is what your directory structure should look like at this point:
Under the current (unconstrained) model, two parameters were estimated for each branch: the synonymous substitution rate and the nonsynonymous substitution rate. Now let's constrain each branch so that the ratio (omega) between the nonsynonymous rate and the synonymous rate is identical for all branches.  
+
bats/
 +
    simdata/
 +
    bats.bf
 +
    irbp.nex
  
To do this, first notice that each branch is represented by two parameters in the Parameter window. For example, the branch leading to Parasite_A is associated with these two parameters:
+
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:
wickett_tree.PARASITE_A.nonSynRate
+
wickett_tree.PARASITE_A.synRate
+
The goal is to constrain these two parameters so that the nonsynonymous rate is always omega times the synonymous rate, where omega is a new parameter shared by all branches.
+
  
Select the two parameters listed above for the branch leading to PARASITE_A. (You can do this by single-clicking both parameters while simultaneously holding down the Shift key.) Once you have both parameters selected, click on the third button from the left at the top of the Parameters window. This is the button decorated with the symbol for proportionality. Clicking this button will produce a long list of possiblities: here is the one you should choose:
+
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; 
  wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate
+
ACCEPT_ROOTED_TREES = TRUE; 
Once you select this option, HyPhy will ask for a name: type
+
ASSUME_REVERSIBLE_MODELS = -1;
  omega
+
as the name of the new ratio.
+
/*****************************************************************************************
 +
| Read in the data and store the result in the variable nucleotideSequences.
 +
*/
 +
DataSet nucleotideSequences = ReadDataFile("irbp.nex");
 +
 +
/*****************************************************************************************
 +
| Filter the data, specifying which sites are to be used. The first 1 means treat each
 +
| site separately (3 would cause the data to be interpreted as codons). The quoted
 +
| string (last argument) specifies which sites (where first site = 0) to use (we are
 +
| excluding sites with a lot of missing data and or alignment issues). This leaves us with
 +
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine
 +
  | exactly which sites were excluded in the original study.
 +
*/
 +
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");
 +
 +
/*****************************************************************************************
 +
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means
 +
| unit=1, atom=1, position-specific=1. These settings create one global set of
 +
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies,
 +
| which is not what we need because we will not be using a codon model).
 +
  */
 +
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);
 +
fprintf(stdout, observedFreqs);
  
Now select the two parameters for a different pair of branches, say
+
=== Run bats.bf ===
wickett_tree.PARASITE_B.nonSynRate
+
wickett_tree.PARASITE_B.synRate
+
Click the proportionality constraint button again, but this time choose
+
wickett_tree.PARASITE_B.nonSynRate:=omega*wickett_tree.PARASITE_B.synRate
+
Note that you can choose to use a constraint for other branches once you have defined it for one branch.
+
  
Continue to apply this constraint to all 19 remaining branches. When you are finished, choose ''Likelihood > Optimize'' from the menu at the top of the Parameters window.
+
Run your <tt>bats.bf</tt> file as follows:
 +
hyphy bats.bf
  
== Performing a model comparison ==
+
You should see the empirical nucleotide frequencies displayed as follows:
 +
{
 +
{0.1936054742803209}
 +
{0.3104845052697813}
 +
{0.3106418121755545}
 +
{0.1852682082743432}
 +
}
  
After HyPhy is finished optimizing the likelihood function, answer the following questions using the numbers at the bottom of the Parameters window:
+
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.
<div style="background-color: #eeeeff">What is the estimated value of the omega parameter?</div>
+
<div style="background-color: #eeeeff">Does this value of omega imply stabilizing selection, neutral evolution or positive selection?</div>
+
<div style="background-color: #eeeeff">What is the maximized log-likelihood of this (most-constrained) model?</div>
+
<div style="background-color: #eeeeff">How many parameters are being estimated now?</div>
+
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy?</div>
+
<div style="background-color: #eeeeff">Does this most-constrained model fit the data better than the unconstrained model?</div>
+
<div style="background-color: #eeeeff">What is the difference between the log-likelihood of this (most-constrained) model and the log-likelihood of the previous (unconstrained) model?</div>
+
<div style="background-color: #eeeeff">What is the likelihood ratio test statistic for this comparison?</div>
+
<div style="background-color: #eeeeff">How many degrees of freedom does this likelihood ratio test have?</div>
+
<div style="background-color: #eeeeff">Is the likelihood ratio test significant? (click [http://faculty.vassar.edu/lowry/tabs.html#csq here] for an online chi-square calculator)</div>
+
<div style="background-color: #eeeeff">Is a model in which one value of omega applies to every branch satisfactory, or is there enough variation in omega across the tree that it is necessary for each branch to have its own specific omega parameter in order to fit the data well?</div>
+
<div style="background-color: #eeeeff">Does AIC concur with the likelihood ratio test? (Hint: models with smaller values of AIC are preferred over models with larger AIC values.)</div>
+
  
Although you should do the calculation yourself first, you can now have HyPhy perform the likelihood ratio test for you to check your calculations. In the drop-down list box at the top of the Parameters window, choose "Save LF state" and name it "most-constrained". Now, using the same list box, choose "Select as null". Now perform the test by choosing LRT from the same drop-down list box. The results should appear in the HYPHY Console window.
+
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.
  
== Computing the likelihood under a partially-constrained model ==
+
=== Define the HKY85 rate matrix ===
  
Let's try one more model that is intermediate between the unconstrained and most-constrained models you just analyzed. This model will allow for omega to be different in the non-green, parasitic clade compared to the remaining green, non-parasite part of the tree.
+
Add the following to the bottom of your <tt>bats.bf</tt> file:
 +
/*****************************************************************************************
 +
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be
 +
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is
 +
| declared to be global, meaning it is shared by all edges.
 +
*/
 +
global kappa = 4.224618;
 +
HKY85RateMatrix =
 +
{{    *      ,    betat    ,    betat*kappa  ,    betat    }
 +
{    betat    ,    *      ,      betat    ,  betat*kappa }
 +
{ betat*kappa ,    betat    ,        *      ,    betat    }
 +
{    betat    , betat*kappa ,      betat    ,      *    }};
 +
fprintf(stdout, HKY85RateMatrix);
  
For one of the three branches in the parasite clade (say, the branch leading to PARASITE_A), select the two parameters associated with the branch and click the rightmost button at the top of the Parameters window (this button releases the constraint previously placed on these two parameters). With the two parameters still selected, click the proportionality constraint button again (third from left) and choose the option
+
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate
+
and specify
+
omega2
+
as the name of the New Ratio. Now apply this new ratio to the other two branches in the clade by first releasing the existing constraint and then applying the omega2 constraint.
+
  
Once you are finished, choose ''Likelihood > Optimize'' again to search for the maximum likelihood point. Now choose "Save LF state", naming this one "partially-constrained". Answer the following questions using the values shown in the Parameter window:
+
=== Combine frequencies with rate matrix to create an HKY85 model ===
<div style="background-color: #eeeeff">What is the maximized log-likelihood under this model?</div>
+
<div style="background-color: #eeeeff">How many parameters were estimated?</div>
+
<div style="background-color: #eeeeff">What is the value of omega now?</div>
+
<div style="background-color: #eeeeff">What is the value of omega2?</div>
+
<div style="background-color: #eeeeff">Which is higher: omega or omega2? Does this make sense in light of what you know about the organisms involved and the function of this gene?</div>
+
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy for this model?</div>
+
<div style="background-color: #eeeeff">Based on AIC, which of the three models tested thus far would you prefer?</div>
+
  
You can now perform a likelihood ratio test. Using the drop-down list box at the top of the Parameters window, specify the most-constrained model to be the null model and the partially-constrained model to be the alternative. Choose LRT from the drop-down list to perform the test.
+
/*****************************************************************************************
 +
| Define the HKY85 model, by combining the substitution matrix with the vector of
 +
| empirical frequencies.  
 +
*/
 +
Model hky1 = (HKY85RateMatrix, observedFreqs);
  
Perform one more likelihood ratio test, this time using the partially-constrained model as the null and the unconstrained model as the alternative.
+
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree.
<div style="background-color: #eeeeff">Do AIC and LRT agree on which model of the three models is best? Why or why not?</div>
+
 
 +
=== Create a tree representing the null hypothesis ===
 +
 
 +
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.
 +
 
 +
/*****************************************************************************************
 +
| Define the tree variable, using the tree description read from the data file.
 +
| By default, the last defined model (hky1) is assigned to all edges of the tree.
 +
*/
 +
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";
 +
 
 +
=== Creating the second model ===
 +
 
 +
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:
 +
 
 +
GetInformation(modelMap, constrainedTree);
 +
fprintf(stdout, modelMap);
 +
 
 +
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:
 +
{
 +
  "Bos":"hky1",
 +
  "Cynocephalus":"hky1",
 +
  "Didelphis":"hky1",
 +
  "Felis":"hky1",
 +
  "Galago":"hky1",
 +
  "Homo":"hky1",
 +
  "Mus":"hky1",
 +
  "Node1":"hky1",
 +
  "Node10":"hky1",
 +
  "Node11":"hky1",
 +
  "Node14":"hky1",
 +
  "Node19":"hky1",
 +
  "Node2":"hky1",
 +
  "Node3":"hky1",
 +
  "Node4":"hky1",
 +
  "Node5":"hky1",
 +
  "Node7":"hky1",
 +
  "Oryctolagus":"hky1",
 +
  "Pteropus":"hky1",
 +
  "Tarsius":"hky1",
 +
  "Tonatia_bidens":"hky1",
 +
  "Tonatia_silvicola":"hky1",
 +
  "Tupaia":"hky1"
 +
}
 +
 
 +
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:
 +
 
 +
/*****************************************************************************************
 +
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges.
 +
*/
 +
highATfreqs = {{.45}{.05}{.05}{.45}};
 +
Model hky2 = (HKY85RateMatrix, highATfreqs);
 +
SetParameter(constrainedTree.microbats, MODEL, hky2);
 +
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);
 +
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);
 +
SetParameter(constrainedTree.Pteropus, MODEL, hky2);
 +
 
 +
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output:  
 +
{
 +
  "Bos":"hky1",
 +
  "Cynocephalus":"hky1",
 +
  "Didelphis":"hky1",
 +
  "Felis":"hky1",
 +
  "Galago":"hky1",
 +
  "Homo":"hky1",
 +
  "Mus":"hky1",
 +
  "Node1":"hky1",
 +
  "Node10":"hky1",
 +
  "Node11":"hky1",
 +
  "Node14":"hky1",
 +
  "Node2":"hky1",
 +
  "Node3":"hky1",
 +
  "Node4":"hky1",
 +
  "Node5":"hky1",
 +
  "Node7":"hky1",
 +
  "Oryctolagus":"hky1",
 +
  "Pteropus":"hky2",
 +
  "Tarsius":"hky1",
 +
  "Tonatia_bidens":"hky2",
 +
  "Tonatia_silvicola":"hky2",
 +
  "Tupaia":"hky1",
 +
  "microbats":"hky2"
 +
}
 +
 
 +
=== Don't get too comfortable just yet ===
 +
 
 +
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:
 +
 
 +
fprintf(stdout, "\n\nCurrent edge lengths:\n");
 +
edgeNames = BranchName(constrainedTree,-1);
 +
edgeLengths = BranchLength(constrainedTree,-1);
 +
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {
 +
fprintf(stdout, Format(edgeLengths[k],10,5), "  ", edgeNames[k], "\n");
 +
}
 +
 
 +
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.
 +
 
 +
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)
 +
 
 +
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.
 +
 
 +
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?
 +
 
 +
=== Fixing up edge lengths ===
 +
 
 +
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.
 +
 
 +
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.
 +
 
 +
betat = 1.0;
 +
scalingFactor = 0.0;
 +
for (n1 = 0; n1 < 4; n1 = n1+1) {
 +
    for (n2 = 0; n2 < 4; n2 = n2+1) {
 +
        if (n2!=n1) {
 +
            scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];
 +
        }
 +
    }
 +
}
 +
 
 +
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.
 +
 
 +
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:
 +
constrainedTree.microbats.betat          := 0.097851/scalingFactor;
 +
constrainedTree.Tonatia_bidens.betat  := 0.008252/scalingFactor;
 +
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;
 +
constrainedTree.Pteropus.betat            := 0.104663/scalingFactor;
 +
 
 +
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.
 +
 
 +
=== Constructing the likelihood function ===
 +
 
 +
/*****************************************************************************************
 +
| Build the likelihood function and print it out, which causes HyPhy to actually compute
 +
| the log-likelihood for the tree.
 +
*/
 +
LikelihoodFunction likelihood = (filteredData, constrainedTree);
 +
fprintf(stdout, likelihood);
 +
 
 +
If you add these lines to the end of your growing <tt>bats.bf</tt> file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.
 +
 
 +
The  <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.
 +
 
 +
=== Ready to simulate! ===
 +
 
 +
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.
 +
 
 +
Here is the simulation loop:
 +
/*****************************************************************************************
 +
| Perform the simulations.
 +
*/
 +
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {
 +
    // Simulate a data set of the same size as the original set
 +
    DataSet simulatedData = SimulateDataSet(likelihood);
 +
 +
    // The filter is necessary, but trivial in this case because alls sites are used
 +
    DataSetFilter filteredSimData = CreateFilter(simulatedData,1);
 +
 +
    // Save simulated data to a file
 +
    outFile = "simdata/sim"+simCounter+".nex";
 +
    fprintf(outFile, filteredSimData);
 +
}
 +
 
 +
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.
 +
 
 +
== Using PAUP* to estimate parsimony trees for each simulated data set ==
 +
 
 +
HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.
 +
 
 +
Create the following file in the same directory as <tt>bats.bf</tt>. Call the new file <tt>parsimony.nex</tt>.
 +
#nexus
 +
 +
begin paup;
 +
    log file=simdata/pauplog.txt start replace;
 +
    set criterion=parsimony warnreset=no warntsave=no;
 +
end;
 +
 +
begin python;
 +
    for i in range(10):
 +
        paup.cmd("exe simdata/sim%d.nex;" % i)
 +
        paup.cmd("hsearch;")
 +
        paup.cmd("contree all;")
 +
        paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")
 +
        paup.cmd("filter constraints=flyingdna;")
 +
end;
 +
 +
begin paup;
 +
    log stop;
 +
    quit;
 +
end;
 +
 
 +
This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.
 +
 
 +
The first paup block starts a log (which will be created in the <tt>simdata</tt> directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.
 +
 
 +
The final paup bock just closes the log file and then quits.
 +
 
 +
The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a <tt>paup.cmd</tt> function call.
 +
 
 +
The first <tt>paup.cmd</tt> executes a file named <tt>sim%d.nex</tt> inside the simdata directory, where <tt>%d</tt> is replaced by the value of <tt>i</tt>. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.
 +
 
 +
The second <tt>paup.cmd</tt> performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.
 +
 
 +
The third <tt>paup.cmd</tt> creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.
 +
 
 +
We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named <tt>flyingdna</tt> and then filter the trees resulting from the parsimony search using the <tt>flyingdna</tt> constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.
 +
 
 +
Run this file in PAUP* to generate the <tt>pauplog.txt</tt> file, then look through that file to see how many replicates yielded bat monophyly.
 +
 
 +
== Final exercise ==
 +
 
 +
Once you confirm that your scripts are working, run your <tt>bats.bf</tt> using HyPhy followed by running <tt>parsimony.nex</tt> in PAUP* a total of 6 times, each with a different value of <tt>highATfreqs</tt> that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both <tt>bats.bf</tt> and <tt>parsimony.nex</tt> so that you get more accurate y-axis values.
 +
 
 +
Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file <tt>pauplog.txt</tt>:
 +
cat simdata/pauplog.txt | grep "Number of trees retained by filter"
 +
The <tt>cat</tt> command simply dumps a file to the screen. Instead of sending the output to the console, the (<tt>|</tt>) causes the output to instead be piped into the <tt>grep</tt> command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.
 +
 
 +
How does your plot compare to the one published in Vandenbussche et al. (1998)?
 +
 
 +
== Literature Cited ==
 +
 
 +
NJ Wickett, Y Fan, P Lewis, and B Goffinet. 2008. Distribution and evolution of pseudogenes, gene losses, and a gene rearrangement in the plastid genome of the nonphotosynthetic liverwort, Aneura mirabilis (Metzgeriales, Jungermanniopsida). Journal of Molecular Evolution 67:111-122. [http://dx.doi.org/10.1007/s00239-008-9133-1 DOI:10.1007/s00239-008-9133-1]
 +
 
 +
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional
 +
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular
 +
Phylogenetics and Evolution 13(3): 408-416. [https://doi.org/10.1006/mpev.1998.0531 DOI:10.1006/mpev.1998.0531]
  
 
[[Category: Phylogenetics]]
 
[[Category: Phylogenetics]]

Latest revision as of 12:02, 26 February 2020

Adiantum.png EEB 349: Phylogenetics

Goals

The goal of this lab exercise is to show you how to use the HyPhy program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.

The lab is divided into two parts, both of which recreate analyses done in past papers.

The first part is designed to show you how easy it is to use HyPhy when what you want to do is available as a canned routine. In this example, we recreate part of the paper by Wickett et al. 2008 in which the goal was to demonstrate that selection is relaxed in a gene no longer being used. You may remember that Norm Wickett gave the keynote address at this year's EEB Graduate Student Symposium. This study was done when he was still here at UConn.

The second part shows that HyPhy allows you to test hypotheses that are nearly impossible to do with any other software, although going outside HyPhy's collection of canned analyses means you will have to deal with some nitty-gritty details. In this case we recreate an analysis done by Vandenbussche et al. (1998) that required software to be written de novo just for that paper. My hope is that you will remember this second analysis when it hits you some day that it would be really nice to test this beautiful hypothesis you've come up with, but can't think of any software designed to test that particular hypothesis and writing your own software is not within the realm of possibility. With HyPhy you may just be able to do it!

Login to Xanadu

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Loading modules needed

Load the module needed for this exercise:

module load hyphy/2.5.4
module load paup/4.0a-166
module load python/2.7.8

Part 1

Start by creating a new directory called norm. Download the dataset (wickett.nex) and tree (wickett.newick) we will use for this analysis using curl:

cd ~/norm
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.tre

This dataset contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort Aneura mirabilis) that is non-green and does not photosynthesize. The sequences are of a gene important for photosynthesis. The question is, can we detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this)?

PhyloTree

The question we are asking is commonly asked, and is thus available as a canned analysis (RELAX) in HyPhy.

Before running HyPhy, take a look at the wickett.tre file:

cat wickett.tre

Note that the newick tree description has been modified in that three edges in the tree have been decorated with {Foreground}. This tells HyPhy which edges of the tree are suspected of being under relaxed selection. HyPhy has an online tool that makes it easy to decorate your tree accurately. Try downloading the wickett.tre file to your own computer and then uploading it to the PhyloTree web app. Use the Input text option from the Newick dropdown menu to choose the wickett.tre file you downloaded. Note how 3 edges are labeled with the Foreground tag. Try selecting a different set of edges and selecting Export from the Newick dropdown (don't worry, this will not affect the original file). Note how PhyloTree provides you with the correctly decorated tree description that you can simply copy and paste into your tree file. You need not use the tag Foreground; you can use whatever name you like as long as you supply that name for the --test option when invoking HyPhy.

RELAX

Run HyPhy as follows:

hyphy relax --alignment wickett.nex --tree wickett.tre --models Minimal --test Foreground

To save time, we will just do the Minimal analysis. Note that Foreground is the tag supplied via the --test option. If you had used a different name to tag the interesting edges, you should supply that name here rather than Foreground.

What does the HyPhy RELAX analysis conclude? Look for the section of output near the bottom that begins

## Test for relaxation (or intensification) of selection [RELAX]

While the RELAX analysis is not the same test Norm et al. carried out, the conclusion is the same. This gene formerly important for photosynthesis is experiencing much less stabilizing selection along the edges leading to or within the parasite clade compared to the reference edges in the remainder of the tree.

Part 2

In this part, we will recreate the very interesting parametric bootstrapping analysis performed in the paper by Vandenbussche et al. (1998). In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.

The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.

This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine.

Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.

Creating the HyPhy batch file

HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named bats.bf in a directory called bats (you can of course use any name you like, but these are the names I will be using). Also create a directory simdata inside your bats directory.

Download the data we will use for this analysis using curl:

cd ~/bats
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex

This is what your directory structure should look like at this point:

bats/
    simdata/
    bats.bf
    irbp.nex

Open bats.bf in your favorite text editor (e.g. nano) and enter the following text:

AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1;  
ACCEPT_ROOTED_TREES = TRUE;  
ASSUME_REVERSIBLE_MODELS = -1;

/*****************************************************************************************
| Read in the data and store the result in the variable nucleotideSequences.
*/
DataSet nucleotideSequences = ReadDataFile("irbp.nex");

/*****************************************************************************************
| Filter the data, specifying which sites are to be used. The first 1 means treat each 
| site separately (3 would cause the data to be interpreted as codons). The quoted 
| string (last argument) specifies which sites (where first site = 0) to use (we are 
| excluding sites with a lot of missing data and or alignment issues). This leaves us with
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine
| exactly which sites were excluded in the original study.
*/
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");

/*****************************************************************************************
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means
| unit=1, atom=1, position-specific=1. These settings create one global set of 
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, 
| which is not what we need because we will not be using a codon model).
*/
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);
fprintf(stdout, observedFreqs);

Run bats.bf

Run your bats.bf file as follows:

hyphy bats.bf

You should see the empirical nucleotide frequencies displayed as follows:

{
{0.1936054742803209}
{0.3104845052697813}
{0.3106418121755545}
{0.1852682082743432}
}

I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (/* comment */) or begin the line with two slashes (// comment). The three lines at the beginning will be explained when each becomes relevant.

The fprintf command also mimics the C programming language. It allows you to print objects to stdout (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable observedFreqs.

Define the HKY85 rate matrix

Add the following to the bottom of your bats.bf file:

/*****************************************************************************************
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is 
| declared to be global, meaning it is shared by all edges.
*/
global kappa = 4.224618;
HKY85RateMatrix = 
		{{     *      ,    betat    ,    betat*kappa  ,     betat    }
		 {    betat    ,     *      ,       betat     ,  betat*kappa }
		 { betat*kappa ,    betat    ,         *      ,     betat    }
		 {    betat    , betat*kappa ,       betat     ,       *     }};
fprintf(stdout, HKY85RateMatrix);

Run bats.bf in HyPhy. Did the HKY85RateMatrix variable have the value you expected? Why or why not? (Hint: the variable betat was initialized to zero.) Set betat = 1; before the fprintf statement and run the batch file again to see the effect. Does the output make sense now?

Combine frequencies with rate matrix to create an HKY85 model

/*****************************************************************************************
| Define the HKY85 model, by combining the substitution matrix with the vector of 
| empirical frequencies. 
*/
Model hky1 = (HKY85RateMatrix, observedFreqs);

Now we have a model variable (hky1) that can be applied to each edge of a tree.

Create a tree representing the null hypothesis

The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.

/*****************************************************************************************
| Define the tree variable, using the tree description read from the data file.
| By default, the last defined model (hky1) is assigned to all edges of the tree. 
*/
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";

Creating the second model

Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the hky1 model has been applied to every node in the tree. You can verify this using the following HBL code:

GetInformation(modelMap, constrainedTree);
fprintf(stdout, modelMap);

Add these two lines to the bottom of your bats.bf file and run it. You should see this near the bottom of your output:

{
 "Bos":"hky1",
 "Cynocephalus":"hky1",
 "Didelphis":"hky1",
 "Felis":"hky1",
 "Galago":"hky1",
 "Homo":"hky1",
 "Mus":"hky1",
 "Node1":"hky1",
 "Node10":"hky1",
 "Node11":"hky1",
 "Node14":"hky1",
 "Node19":"hky1",
 "Node2":"hky1",
 "Node3":"hky1",
 "Node4":"hky1",
 "Node5":"hky1",
 "Node7":"hky1",
 "Oryctolagus":"hky1",
 "Pteropus":"hky1",
 "Tarsius":"hky1",
 "Tonatia_bidens":"hky1",
 "Tonatia_silvicola":"hky1",
 "Tupaia":"hky1"
}

First, create a second model named hky2 and apply it to four specific edges in the tree:

/*****************************************************************************************
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. 
*/
highATfreqs = {{.45}{.05}{.05}{.45}};
Model hky2 = (HKY85RateMatrix, highATfreqs);
SetParameter(constrainedTree.microbats, MODEL, hky2);
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);
SetParameter(constrainedTree.Pteropus, MODEL, hky2);

If you move your GetInformation call and its accompanying fprintf after these 6 lines, then you should see this in the output:

{
 "Bos":"hky1",
 "Cynocephalus":"hky1",
 "Didelphis":"hky1",
 "Felis":"hky1",
 "Galago":"hky1",
 "Homo":"hky1",
 "Mus":"hky1",
 "Node1":"hky1",
 "Node10":"hky1",
 "Node11":"hky1",
 "Node14":"hky1",
 "Node2":"hky1",
 "Node3":"hky1",
 "Node4":"hky1",
 "Node5":"hky1",
 "Node7":"hky1",
 "Oryctolagus":"hky1",
 "Pteropus":"hky2",
 "Tarsius":"hky1",
 "Tonatia_bidens":"hky2",
 "Tonatia_silvicola":"hky2",
 "Tupaia":"hky1",
 "microbats":"hky2"
}

Don't get too comfortable just yet

It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:

fprintf(stdout, "\n\nCurrent edge lengths:\n");
edgeNames 	= BranchName(constrainedTree,-1); 
edgeLengths	= BranchLength(constrainedTree,-1);
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {
	fprintf(stdout, Format(edgeLengths[k],10,5), "  ", edgeNames[k], "\n");
}

The first line simply skips a couple of lines (\n\n) and prints a header announcing that current edge lengths will follow.

The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The -1 in these functions is somewhat obscure, but means "give me all of them". (If you used 5 rather than -1, it would give you the name of the edge having index 5.)

The last 3 lines are a loop in which the variable k ranges from 0 to the number of edges minus 1. For each value of k, the fprintf statement prints out the length of edge k (the Format command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge k. The newline character ("\n") at the end of the fprintf statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.

After running bats.bf, what possibly important detail do you notice about the lengths of the edges to which we attached the hky2 model?

Fixing up edge lengths

The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the betat parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1 that we placed at the top of the file saves us from having to go through this procedure for all the hky1 edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.

Recall that, under the JC69 model, the edge length equals v = 3*beta*t. The betat parameter that appears in our HKY85RateMatrix is the beta*t part, so to set the betat parameter we need to divide the desired edge length by 3: that is, betat = v/3. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to betat values.

betat = 1.0;
scalingFactor = 0.0;
for (n1 = 0; n1 < 4; n1 = n1+1) {
    for (n2 = 0; n2 < 4; n2 = n2+1) {
        if (n2!=n1) {
            scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];
        }
    }
}

The two nested loops visit every off-diagonal element of the HKY85RateMatrix, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum scalingFactor. Review your lecture notes if you don't remember why scalingFactor is computed in this way. Note that it is important to set betat = 1 before the loop in order to ensure that the scaling factor works correctly.

Now we simply need to set the betat values for the four edges of interest:

constrainedTree.microbats.betat           := 0.097851/scalingFactor;
constrainedTree.Tonatia_bidens.betat   := 0.008252/scalingFactor;
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;
constrainedTree.Pteropus.betat             := 0.104663/scalingFactor;

Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.

Constructing the likelihood function

/*****************************************************************************************
| Build the likelihood function and print it out, which causes HyPhy to actually compute
| the log-likelihood for the tree.
*/
LikelihoodFunction likelihood = (filteredData, constrainedTree);
fprintf(stdout, likelihood);

If you add these lines to the end of your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to -6472.478318.

The ASSUME_REVERSIBLE_MODELS = -1 that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The ACCEPT_ROOTED_TREES = TRUE at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.

Ready to simulate!

We are now ready to perform the simulations. As soon as you create a LikelihoodFunction object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.

Here is the simulation loop:

/*****************************************************************************************
| Perform the simulations. 
*/
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {
    // Simulate a data set of the same size as the original set
    DataSet simulatedData = SimulateDataSet(likelihood);

    // The filter is necessary, but trivial in this case because alls sites are used 
    DataSetFilter filteredSimData = CreateFilter(simulatedData,1);

    // Save simulated data to a file
    outFile = "simdata/sim"+simCounter+".nex";
    fprintf(outFile, filteredSimData);
}

I have used both styles of comments here: the main comment before the loop is done using the /* ... */ style, and the double-slash style is used for comments within the loop. The LikelihoodFunction object likelihood is passed to the SimulateDataSet function to generate the data. It is always necessary to create a DataSetFilter in HyPhy, even if no filtering occurs. If one prints filteredSimData using fprintf to stdout, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving fprintf a file name rather than stdout, which causes the simulated data to be saved to that file. The file name is constructed using simCounter so that every simulated file has a name that is different. Note that all simulated data will be saved in the simdata directory that you created early on in this tutorial.

Using PAUP* to estimate parsimony trees for each simulated data set

HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.

Create the following file in the same directory as bats.bf. Call the new file parsimony.nex.

#nexus

begin paup;
    log file=simdata/pauplog.txt start replace;
    set criterion=parsimony warnreset=no warntsave=no;
end;

begin python;
    for i in range(10):
        paup.cmd("exe simdata/sim%d.nex;" % i)
        paup.cmd("hsearch;")
        paup.cmd("contree all;")
        paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")
        paup.cmd("filter constraints=flyingdna;")
end;

begin paup;
    log stop;
    quit;
end;

This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.

The first paup block starts a log (which will be created in the simdata directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.

The final paup bock just closes the log file and then quits.

The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a paup.cmd function call.

The first paup.cmd executes a file named sim%d.nex inside the simdata directory, where %d is replaced by the value of i. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.

The second paup.cmd performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.

The third paup.cmd creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.

We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named flyingdna and then filter the trees resulting from the parsimony search using the flyingdna constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.

Run this file in PAUP* to generate the pauplog.txt file, then look through that file to see how many replicates yielded bat monophyly.

Final exercise

Once you confirm that your scripts are working, run your bats.bf using HyPhy followed by running parsimony.nex in PAUP* a total of 6 times, each with a different value of highATfreqs that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both bats.bf and parsimony.nex so that you get more accurate y-axis values.

Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file pauplog.txt:

cat simdata/pauplog.txt | grep "Number of trees retained by filter"

The cat command simply dumps a file to the screen. Instead of sending the output to the console, the (|) causes the output to instead be piped into the grep command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.

How does your plot compare to the one published in Vandenbussche et al. (1998)?

Literature Cited

NJ Wickett, Y Fan, P Lewis, and B Goffinet. 2008. Distribution and evolution of pseudogenes, gene losses, and a gene rearrangement in the plastid genome of the nonphotosynthetic liverwort, Aneura mirabilis (Metzgeriales, Jungermanniopsida). Journal of Molecular Evolution 67:111-122. DOI:10.1007/s00239-008-9133-1

RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular Phylogenetics and Evolution 13(3): 408-416. DOI:10.1006/mpev.1998.0531