Difference between revisions of "Phylogenetics: HyPhy Lab"

From EEBedia
Jump to: navigation, search
(Loading modules needed)
Line 8: Line 8:
 
== Goal ==
 
== Goal ==
  
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allow the model of evolution to change across a tree.
+
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.
 
+
== Obtaining the sequences ==
+
 
+
A Nexus data file containing sequences and a tree is located here: [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex wickett.nex]. This dataset was assembled by former UConn EEB graduate student [http://www.chicagobotanic.org/research/staff/wickett Norm Wickett] and contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort <em>Aneura mirabilis</em>) that is non-green and does not photosynthesize. Today's lab will recreate the type of analysis Norm carried out in [http://dx.doi.org/10.1007/s00239-008-9133-1 his 2008 paper in Journal of Molecular Evolution (67:111-122)].
+
 
+
The sequences are of a gene important for photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).
+
  
 
== Login to Xanadu ==
 
== Login to Xanadu ==
Line 20: Line 14:
 
Login to Xanadu and request a machine as usual:
 
Login to Xanadu and request a machine as usual:
 
  srun --pty -p mcbstudent --qos=mcbstudent bash
 
  srun --pty -p mcbstudent --qos=mcbstudent bash
 
<!--
 
== HyPhy ==
 
 
I have requested that the latest version of [HyPhy http://www.hyphy.org] be installed on the Xanadu cluster, but that hasn't happened yet. I have placed the executable file in the scratch folder. You should create a <tt>bin</tt> directory in your home directory (if you haven't already done so) and copy the hyphy "binary" file there:
 
cd
 
mkdir bin
 
cp /scratch/phylogenetics/hyphy ~/bin
 
-->
 
  
 
== Loading modules needed ==
 
== Loading modules needed ==
Line 36: Line 21:
 
  module load paup/4.0a-166
 
  module load paup/4.0a-166
  
== Loading data into HyPhy ==
+
== The goal of this lab ==
 
+
Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose ''File > Open > Open Data File'', then navigate to and select the <tt>wickett.nex</tt> data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the '''Data window''' from this point on.
+
 
+
== Creating a partition ==
+
 
+
HyPhy thinks of your data as being composed of one or more '''partitions'''. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene).
+
 
+
=== The word partition is used in two ways ===
+
The word partition is ambiguous: it formerly meant "wall" or "divider" but, with the advent of computer hard drives, it has also come to mean the space ''between'' the walls or dividers. When someone says they ''partitioned their data'', they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they ''applied a GTR+I+G model to the rbcL partition'', they have now switched to using the word partition to mean the sites on the rbcL side of the divider.
+
 
+
=== No partitioning implies one partition! ===
+
Even if you choose to '''not''' partition (old meaning) your data in HyPhy, you must go through the motions of creating a single partition (new meaning) because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose ''Edit > Select All'' from the Data window menu, then choose ''Data > Selection->Partition'' to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".
+
 
+
=== Assign a data type to your partition ===
+
Now that you have a partition, you can create a model for it. Under the column name ''Partition Type'', choose ''codon'' (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is ''Di-nucl.'', which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.
+
 
+
=== Assign a tree topology to your partition ===
+
Under Tree Topology, you have several options. Because a tree topology was defined in the <tt>wickett.nex</tt> data file, this tree topology shows up in the drop-down list as <tt>wickett_tree</tt>. Choose <tt>wickett_tree</tt> as the tree topology for your partition.
+
 
+
=== Assign a substitution model to your partition ===
+
The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is ''MG94xHKY85_3x4''. This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synonymous and non-synonymous substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.
+
 
+
The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:
+
* the frequency of A nucleotides at first positions
+
* the frequency of G nucleotides at second positions
+
* the frequency of T nucleotides at third positions
+
(Note: HyPhy corrects these for the fact that the three stop codons are not included.)
+
This involves estimating the '''4''' nucleotides frequencies at each of the '''3''' codon positions, hence the '''3x4''' in the name.
+
 
+
=== Local vs. global ===
+
You have only a couple more decisions to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. '''Local''' means that HyPhy will estimate some substitution model parameters for every branch in the tree. '''Global''' means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).
+
 
+
Tell HyPhy to use the Local option (this should already be set correctly).
+
 
+
=== Equilibrium frequencies===
+
You should also leave the equilibrium frequencies set to "Partition". This sets the equilibrium base frequencies to the empirical values (i.e. the frequency of A is the number of As observed in the entire partition divided by the total number of nucleotides in the partition). Other options include:
+
* Dataset, which would not be different than "Partition" in this case where there is only one partition defined,
+
* Equal, which sets all base frequencies equal to 0.25, and
+
* Estimate, which estimates the base frequencies
+
 
+
== Computing the likelihood under a local codon model ==
+
 
+
You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose ''Likelihood > Build Function'' to build a likelihood function, then ''Likelihood > Optimize'' to optimize the likelihood function (i.e. search for the highest point on the likelihood surface, thus obtaining maximum likelihood estimates of all parameters).
+
 
+
=== Saving the results ===
+
When HyPhy has finished optimizing (this will take several seconds to several minutes, depending on the speed of the computer you are using), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the '''Parameters window''') showing you values for all the quantities it estimated.
+
 
+
Click on the '''HYPHY Console window''' to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions:
+
<div style="background-color:#eeeeff">What is the maximum log-likelihood under this unconstrained model? {{title|-4203.47237161049|answer}}</div>
+
<div style="background-color:#eeeeff">How many shared (i.e. global) parameters does HyPhy say it estimated? {{title|1|answer}}</div>
+
<div style="background-color:#eeeeff">What are these global parameters? {{title|tree topology|answer}}</div>
+
<div style="background-color:#eeeeff">How many local parameters does HyPhy say it estimated? {{title|42|answer}}</div>
+
<div style="background-color:#eeeeff">What are these local parameters?'' (Hint: for n taxa, there are 2n-3 branches) {{title|synonymous and nonsynonymous rate for each of the 21 edges for 12 taxa|answer}}</div>
+
 
+
Switch back to the Parameters window now and look at the very bottom of the window to answer these questions:
+
<div style="background-color:#eeeeff">What is the total number of parameters estimated? {{title|43|answer}}</div>
+
<div style="background-color:#eeeeff">What is the value of AIC reported by HyPhy? {{title|8492.944743220985|answer}}</div>
+
<div style="background-color:#eeeeff">Calculate the AIC yourself using this formula: AIC = -2*lnL + 2*nparams {{title|8492.944743221|answer}}</div>
+
 
+
Before moving on, save a snapshot of the likelihood function with the current parameter values by choosing "Save LF state" from the drop-down list box at the top of the Parameters window. Choose the name "unconstrained" when asked. After saving the state of the likelihood function, choose "Select as alternative" from the same drop-down list. This will allow us to easily perform likelihood ratio tests using another, simpler model as the null model.
+
 
+
=== Viewing the tree and obtaining information about branches ===
+
The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths).
+
 
+
The next step is to compare the unconstrained model (in which there are the same number of omega parameters as there are branches) with simpler models involving fewer omega parameters. For example, one model you will use in a few minutes allows the three branches in the parasite clade to evolve under one omega, while all other branches evolve under an omega value that is potentially different. For future reference, you should determine now what name HyPhy is using for the branch leading to the two parasite taxa.
+
 
+
Click on the branch leading to the two parasites. It should turn into a dotted line. Now double-click this branch and you should get a dialog box popping up with every bit of information known about this branch:
+
<div style="background-color: #eeeeff">What is the branch id for this branch that leads to the two parasite sequences? {{title|Node10|answer}}</div>
+
You can now close the "Branch Info" dialog box.
+
 
+
== Computing the likelihood under the most-constrained model ==
+
Under the current (unconstrained) model, two parameters were estimated for each branch: the synonymous substitution rate and the nonsynonymous substitution rate. Now let's constrain each branch so that the ratio (omega) between the nonsynonymous rate and the synonymous rate is identical for all branches.
+
 
+
To do this, first notice that each branch is represented by two parameters in the Parameter window. For example, the branch leading to Parasite_A is associated with these two parameters:
+
wickett_tree.PARASITE_A.nonSynRate
+
wickett_tree.PARASITE_A.synRate
+
The goal is to constrain these two parameters so that the nonsynonymous rate is always omega times the synonymous rate, where omega is a new parameter shared by all branches.
+
 
+
Select the two parameters listed above for the branch leading to PARASITE_A. (You can do this by single-clicking both parameters while simultaneously holding down the Shift key.) Once you have both parameters selected, click on the third button from the left at the top of the Parameters window. This is the button decorated with the symbol for proportionality. Clicking this button will produce a long list of possiblities: here is the one you should choose:
+
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate
+
Once you select this option, HyPhy will ask for a name: type
+
omega
+
as the name of the new ratio.
+
 
+
Now select the two parameters for a different pair of branches, say
+
wickett_tree.PARASITE_B.nonSynRate
+
wickett_tree.PARASITE_B.synRate
+
Click the proportionality constraint button again, but this time choose
+
wickett_tree.PARASITE_B.nonSynRate:=omega*wickett_tree.PARASITE_B.synRate
+
Note that you can choose to use a constraint for other branches once you have defined it for one branch.
+
 
+
Continue to apply this constraint to all 19 remaining branches. When you are finished, choose ''Likelihood > Optimize'' from the menu at the top of the Parameters window.
+
 
+
== Performing a model comparison ==
+
 
+
After HyPhy is finished optimizing the likelihood function, answer the following questions using the numbers at the bottom of the Parameters window:
+
<div style="background-color: #eeeeff">What is the estimated value of the omega parameter? {{title|0.0247457593714435|answer}}</div>
+
<div style="background-color: #eeeeff">Does this value of omega imply stabilizing selection, neutral evolution or positive selection? {{title|stabilizing selection|answer}}</div>
+
<div style="background-color: #eeeeff">What is the maximized log-likelihood of this (most-constrained) model? {{title|-4224.870964230792|answer}}</div>
+
<div style="background-color: #eeeeff">How many parameters are being estimated now? {{title|23|answer}}</div>
+
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy? {{title|8495.741928461584|answer}}</div>
+
<div style="background-color: #eeeeff">Does this most-constrained model fit the data better than the unconstrained model? {{title|no|answer}}</div>
+
<div style="background-color: #eeeeff">What is the difference between the log-likelihood of this (most-constrained) model and the log-likelihood of the previous (unconstrained) model? {{title|21.3986|answer}}</div>
+
<div style="background-color: #eeeeff">What is the likelihood ratio test statistic for this comparison? {{title|42.7972|answer}}</div>
+
<div style="background-color: #eeeeff">How many degrees of freedom does this likelihood ratio test have? {{title|20|answer}}</div>
+
<div style="background-color: #eeeeff">Is the likelihood ratio test significant? (click [http://faculty.vassar.edu/lowry/tabs.html#csq here] for an online chi-square calculator) {{title|yes at significance level 0.00217427|answer}}</div>
+
<div style="background-color: #eeeeff">Is a model in which one value of omega applies to every branch satisfactory, or is there enough variation in omega across the tree that it is necessary for each branch to have its own specific omega parameter in order to fit the data well? {{title|each branch needs its own omega|answer}}</div>
+
<div style="background-color: #eeeeff">Does AIC concur with the likelihood ratio test? (Hint: models with smaller values of AIC are preferred over models with larger AIC values.) {{title|yes, the 8492 AIC for the unconstrained model is less than the 8495 AIC for the most constrained model|answer}}</div>
+
 
+
Although you should do the calculation yourself first, you can now have HyPhy perform the likelihood ratio test for you to check your calculations. In the drop-down list box at the top of the Parameters window, choose "Save LF state" and name it "most-constrained". Now, using the same list box, choose "Select as null". Now perform the test by choosing LRT from the same drop-down list box. The results should appear in the HYPHY Console window.
+
 
+
== Computing the likelihood under a partially-constrained model ==
+
  
Let's try one more model that is intermediate between the unconstrained and most-constrained models you just analyzed. This model will allow for omega to be different in the non-green, parasitic clade compared to the remaining green, non-parasite part of the tree.
+
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:
  
For one of the three branches in the parasite clade (say, the branch leading to PARASITE_A), select the two parameters associated with the branch and click the rightmost button at the top of the Parameters window (this button releases the constraint previously placed on these two parameters). With the two parameters still selected, click the proportionality constraint button again (third from left) and choose the option
+
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate
+
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular
and specify
+
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]
omega2
+
as the name of the New Ratio. Now apply this new ratio to the other two branches in the clade by first releasing the existing constraint and then applying the omega2 constraint.
+
  
Once you are finished, choose ''Likelihood > Optimize'' again to search for the maximum likelihood point. Now choose "Save LF state", naming this one "partially-constrained". Answer the following questions using the values shown in the Parameter window:
+
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.
<div style="background-color: #eeeeff">What is the maximized log-likelihood under this (partially-constrained) model? {{title|-4221.520105501849|answer}}</div>
+
<div style="background-color: #eeeeff">How many parameters were estimated? {{title|24|answer}}</div>
+
<div style="background-color: #eeeeff">What is the value of omega now? {{title|0.02294237571140109|answer}}</div>
+
<div style="background-color: #eeeeff">What is the value of omega2? {{title|0.1183733251322421|answer}}</div>
+
<div style="background-color: #eeeeff">Which is higher: omega or omega2?  {{title|omega2|answer}} Does this make sense in light of what you know about the organisms involved and the function of this gene? {{title|yes, selection is expected to be closer to neutral on this gene|answer}}</div>
+
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy for this model? {{title|8491.040211003698|answer}}</div>
+
<div style="background-color: #eeeeff">Based on AIC, which of the three models tested thus far would you prefer? {{title|xxxx|answer}}</div>
+
  
You can now perform a likelihood ratio test. Using the drop-down list box at the top of the Parameters window, specify the most-constrained model to be the null model and the partially-constrained model to be the alternative. Choose LRT from the drop-down list to perform the test.
+
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.
<div style="background-color: #eeeeff">Does the partially-constrained model fit the data significantly better than the most-constrained model?  {{title|yes, at significance level 0.00963201|answer}}</div>
+
  
Perform one more likelihood ratio test, this time using the partially-constrained model as the null and the unconstrained model as the alternative.
+
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine.  
<div style="background-color: #eeeeff">Does the unconstrained model fit the data significantly better than the partially-constrained model?  {{title|yes, at significance level 0.0102742|answer}}</div>
+
<div style="background-color: #eeeeff">Do AIC and LRT agree on which model of the three models is best?  {{title|no, AIC favors the partially-constrained model, while LRT favors the unconstrained model|answer}} Why or why not?  {{title|To win using AIC, a model must increase the lnL by 1 for each additional parameter. The unconstrained model increases lnL by 18 but requires 19 parameters more than the partially-constrained model. Note that the LRT favors the unconstrained model but only at the 0.01 significance level|answer}}</div>
+
  
<!--
+
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.
Summary of results:
+
unconstrained
+
params: 43
+
lnL:    -4203.47237 (*** best ***)
+
AIC:   8492.94474 (middle)
+
  
partially-constrained
+
== Creating the HyPhy batch file ==
params: 24
+
lnL:    -4221.52011 (middle)
+
AIC:    8491.04021 (*** best ***)
+
  
most-constained
+
HyPhy is its own scripting language. Like Python, a script written in HyPhy's batch language (HBL) can be run from the command line to carry out phylogenetic analyses. Start by creating a file named bats
params: 23
+
lnL:    -4224.87096 (worst)
+
AIC:    8495.74193 (worst)
+
-->
+
  
 
[[Category: Phylogenetics]]
 
[[Category: Phylogenetics]]

Revision as of 14:48, 24 February 2020

Adiantum.png EEB 349: Phylogenetics

Goal

The goal of this lab exercise is to show you how to use the HyPhy program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.

Login to Xanadu

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Loading modules needed

Load the module needed for this exercise:

module load hyphy/2.3.11
module load paup/4.0a-166

The goal of this lab

In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:

RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]

In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.

The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.

This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine.

Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.

Creating the HyPhy batch file

HyPhy is its own scripting language. Like Python, a script written in HyPhy's batch language (HBL) can be run from the command line to carry out phylogenetic analyses. Start by creating a file named bats