Phylogenetics: HyPhy Lab

From EEBedia
Revision as of 20:41, 24 February 2007 by PaulLewis (Talk | contribs) (Assign a substitution model to your partition)

Jump to: navigation, search
Under construction.png This article is still under construction.
Expect it to change frequently until this notice is removed.
Adiantum.png EEB 349: Phylogenetics
The goal of this lab exercise is to show you how to use the HyPhy program for data exploration and hypothesis testing within a maximum likelihood framework.

Part A: Obtaining the sequences

A Nexus data file containing sequences and a tree is located here: wickett.nex. This dataset was assembled by our own Norm Wickett and contains several sequences of bryophytes, including two from a parasitic bryophyte that is non-green and does not photosynthesize. Because Norm's sequences have not yet been published, the names of the taxa have been changed, but the parasitic ones are clearly labeled.

The sequences are of a gene important for photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).

Part B: Using a codon model

Although HyPhy can utilize the same standard models found in PAUP*, it also lets you to do some interesting and useful things that PAUP* cannot, such as (1) use codon and secondary structure models, and (2) allow the model of evolution to change across a tree.

Loading data into HyPhy

Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose File > Open > Open Data File, then navigate to and select the wickett.nex data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the Data window from this point on.

Creating a partition

HyPhy thinks of your data as being composed of one or more partitions. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene).

The word partition is used in two ways

The word partition is ambiguous: it formerly meant "wall" or "divider" but, with the advent of computer hard drives, it has also come to mean the space between the walls or dividers. When someone says they partitioned their data, they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they applied a GTR+I+G model to the rbcL partition, they have now switched to using the word partition to mean the sites on the rbcL side of the divider.

No partitioning implies one partition!

Even if you choose to not partition (old meaning) your data in HyPhy, you must go through the motions of creating a single partition (new meaning) because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose Edit > Select All from the Data window menu, then choose Data > Selection->Partition to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".

Assign a data type to your partition

Now that you have a partition, you can create a model for it. Under the column name Partition Type, choose codon (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is Di-nucl., which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.

Assign a tree topology to your partition

Under Tree Topology, you have several options. Because a tree topology was defined in the wickett.nex data file, this tree topology shows up in the drop-down list as wickett_tree. Choose wickett_tree as the tree topology for your partition.

Assign a substitution model to your partition

The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is MG94xHKY85_3x4 (second from the bottom). This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synon. and non-synon. substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.

The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:

  • the frequency of A nucleotides at first positions
  • the frequency of G nucleotides at second positions
  • the frequency of T nucleotides at third positions

(Note: HyPhy corrects these for the fact that the three stop codons are not included.) This involves estimating the 4 nucleotides frequencies at each of the 3 codon positions, hence the 3x4 in the name.

Local vs. global

You have one last decision to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. Local means that HyPhy will estimate some substitution model parameters for every branch in the tree. Global means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).

For now, tell HyPhy to use the Global option here. We will later change this to Local for comparison.

Computing the likelihood under a global codon model

You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose Likelihood > Build to build a likelihood function, then Likelihood > Optimize to optimize the likelihood function (i.e. search for the highest point, thus obtaining maximum likelihood estimates of all parameters).

Saving the results

When HyPhy has finished optimizing (this will take several minutes), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the Parameters window) showing you values for all the quantities it estimated.

Click on the HYPHY Console window to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions on a piece of paper for later reference:

  • What is the maximum log-likelihood under this model?
  • What is the value of omega, the ratio of the nonsynonymous rate to the synonymous rate? (Hint: this parameter is labeled "wickett_part_Shared_R" - the "R" presumably stands for "Ratio")
  • How many local parameters does HyPhy say it estimated?
  • How many shared (i.e. global) parameters does HyPhy say it estimated?
  • Based on the estimate of omega, are these sequences evolving under the influence of stabilizing selection, positive selection or neutral evolution?

Viewing the tree

The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths). There is not much else you can do with the tree window at this point, so close it for now.

Computing the likelihood under a local codon model

Now let's switch to using a local model; i.e. a model in which each branch has its own value of omega, the ratio of nonsynonymous rate to synonymous rate. To accomplish this, return to the Data window and change (under the Parameters column) "Global" to "Local". HyPhy will issue a stern warning about this action invalidating the likelihood function, but that's ok because you already saved the previous likelihood function under the name "global". Just press the Ok button to proceed.

Rebuild the likelihood function (by choosing Likelihood > Build from the menu at the top of the Data window), then optimize the likelihood function using Likelihood > Optimize. The optimization will take longer this time because there are many more parameters to estimate.

While you are waiting, you can answer these questions using the output in the HYPHY Console window:

  • How many parameters does HyPhy now say are shared?
  • Which shared parameters are missing this time that were present before when you performed the global analysis?
  • How many parameters are now local?

Saving the results

Once the parameter window pops up again, indicating that HyPhy is finished with its optimization, answer the following questions (some of the answers you will need to get from the HYPHY console window):

  • What is the maximized log-likelihood for the local model?
  • Does this model fit the data better than the global model?
  • What is the difference between the log-likelihood of this model and the log-likelihood of the previous model?
  • What is the likelihood ratio test statistic for this comparison?
  • How many degrees of freedom does this likelihood ratio test have?
  • Does this model fit the data significantly better than the global model?

Part C: Testing evolutionary phylogenetic hypotheses