http://hydrodictyon.eeb.uconn.edu/eebedia/api.php?action=feedcontributions&user=Paul+Lewis&feedformat=atomEEBedia - User contributions [en]2020-02-25T06:56:20ZUser contributionsMediaWiki 1.25.2http://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41287Phylogenetics: HyPhy Lab2020-02-24T20:54:39Z<p>Paul Lewis: /* Creating the HyPhy batch file */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.5.4<br />
module load paup/4.0a-166<br />
module load python/2.7.8<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
betat = 1.0;<br />
scalingFactor = 0.0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
<br />
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.betat := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.betat := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;<br />
constrainedTree.Pteropus.betat := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
=== Constructing the likelihood function ===<br />
<br />
/*****************************************************************************************<br />
| Build the likelihood function and print it out, which causes HyPhy to actually compute<br />
| the log-likelihood for the tree.<br />
*/<br />
LikelihoodFunction likelihood = (filteredData, constrainedTree);<br />
fprintf(stdout, likelihood);<br />
<br />
If you add the 2 lines above to your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.<br />
<br />
The <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.<br />
<br />
=== Ready to simulate! ===<br />
<br />
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.<br />
<br />
Here is the simulation loop:<br />
/*****************************************************************************************<br />
| Perform the simulations. <br />
*/<br />
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {<br />
// Simulate a data set of the same size as the original set<br />
DataSet simulatedData = SimulateDataSet(likelihood);<br />
<br />
// The filter is necessary, but trivial in this case because alls sites are used <br />
DataSetFilter filteredSimData = CreateFilter(simulatedData,1);<br />
<br />
// Save simulated data to a file<br />
outFile = "simdata/sim"+simCounter+".nex";<br />
fprintf(outFile, filteredSimData);<br />
}<br />
<br />
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.<br />
<br />
== Using PAUP* to estimate parsimony trees for each simulated data set ==<br />
<br />
HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.<br />
<br />
Create the following file in the same directory as <tt>bats.bf</tt>. Call the new file <tt>parsimony.nex</tt>.<br />
#nexus<br />
<br />
begin paup;<br />
log file=simdata/pauplog.txt start replace;<br />
set criterion=parsimony warnreset=no warntsave=no;<br />
end;<br />
<br />
begin python;<br />
for i in range(10):<br />
paup.cmd("exe simdata/sim%d.nex;" % i)<br />
paup.cmd("hsearch;")<br />
paup.cmd("contree all;")<br />
paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")<br />
paup.cmd("filter constraints=flyingdna;")<br />
end;<br />
<br />
begin paup;<br />
log stop;<br />
quit;<br />
end;<br />
<br />
This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.<br />
<br />
The first paup block starts a log (which will be created in the <tt>simdata</tt> directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.<br />
<br />
The final paup bock just closes the log file and then quits.<br />
<br />
The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a <tt>paup.cmd</tt> function call. <br />
<br />
The first <tt>paup.cmd</tt> executes a file named <tt>sim%d.nex</tt> inside the simdata directory, where <tt>%d</tt> is replaced by the value of <tt>i</tt>. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.<br />
<br />
The second <tt>paup.cmd</tt> performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.<br />
<br />
The third <tt>paup.cmd</tt> creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.<br />
<br />
We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named <tt>flyingdna</tt> and then filter the trees resulting from the parsimony search using the <tt>flyingdna</tt> constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.<br />
<br />
Run this file in PAUP* to generate the <tt>pauplog.txt</tt> file, then look through that file to see how many replicates yielded bat monophyly.<br />
<br />
== Final exercise ==<br />
<br />
Once you confirm that your scripts are working, run your <tt>bats.bf</tt> using HyPhy followed by running <tt>parsimony.nex</tt> in PAUP* a total of 6 times, each with a different value of <tt>highATfreqs</tt> that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both <tt>bats.bf</tt> and <tt>parsimony.nex</tt> so that you get more accurate y-axis values. <br />
<br />
Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file <tt>pauplog.txt</tt>:<br />
cat simdata/pauplog.txt | grep "Number of trees retained by filter"<br />
The <tt>cat</tt> command simply dumps a file to the screen. Instead of sending the output to the console, the (<tt>|</tt>) causes the output to instead be piped into the <tt>grep</tt> command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.<br />
<br />
How does your plot compare to the one published in Vandenbussche et al. (1998)?<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41286Phylogenetics: HyPhy Lab2020-02-24T20:41:06Z<p>Paul Lewis: /* Loading modules needed */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.5.4<br />
module load paup/4.0a-166<br />
module load python/2.7.8<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
betat = 1.0;<br />
scalingFactor = 0.0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
<br />
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.betat := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.betat := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;<br />
constrainedTree.Pteropus.betat := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
=== Constructing the likelihood function ===<br />
<br />
/*****************************************************************************************<br />
| Build the likelihood function and print it out, which causes HyPhy to actually compute<br />
| the log-likelihood for the tree.<br />
*/<br />
LikelihoodFunction likelihood = (filteredData, constrainedTree);<br />
fprintf(stdout, likelihood);<br />
<br />
If you add the 2 lines above to your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.<br />
<br />
The <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.<br />
<br />
=== Ready to simulate! ===<br />
<br />
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.<br />
<br />
Here is the simulation loop:<br />
/*****************************************************************************************<br />
| Perform the simulations. <br />
*/<br />
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {<br />
// Simulate a data set of the same size as the original set<br />
DataSet simulatedData = SimulateDataSet(likelihood);<br />
<br />
// The filter is necessary, but trivial in this case because alls sites are used <br />
DataSetFilter filteredSimData = CreateFilter(simulatedData,1);<br />
<br />
// Save simulated data to a file<br />
outFile = "simdata/sim"+simCounter+".nex";<br />
fprintf(outFile, filteredSimData);<br />
}<br />
<br />
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.<br />
<br />
== Using PAUP* to estimate parsimony trees for each simulated data set ==<br />
<br />
HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.<br />
<br />
Create the following file in the same directory as <tt>bats.bf</tt>. Call the new file <tt>parsimony.nex</tt>.<br />
#nexus<br />
<br />
begin paup;<br />
log file=simdata/pauplog.txt start replace;<br />
set criterion=parsimony warnreset=no warntsave=no;<br />
end;<br />
<br />
begin python;<br />
for i in range(10):<br />
paup.cmd("exe simdata/sim%d.nex;" % i)<br />
paup.cmd("hsearch;")<br />
paup.cmd("contree all;")<br />
paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")<br />
paup.cmd("filter constraints=flyingdna;")<br />
end;<br />
<br />
begin paup;<br />
log stop;<br />
quit;<br />
end;<br />
<br />
This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.<br />
<br />
The first paup block starts a log (which will be created in the <tt>simdata</tt> directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.<br />
<br />
The final paup bock just closes the log file and then quits.<br />
<br />
The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a <tt>paup.cmd</tt> function call. <br />
<br />
The first <tt>paup.cmd</tt> executes a file named <tt>sim%d.nex</tt> inside the simdata directory, where <tt>%d</tt> is replaced by the value of <tt>i</tt>. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.<br />
<br />
The second <tt>paup.cmd</tt> performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.<br />
<br />
The third <tt>paup.cmd</tt> creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.<br />
<br />
We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named <tt>flyingdna</tt> and then filter the trees resulting from the parsimony search using the <tt>flyingdna</tt> constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.<br />
<br />
Run this file in PAUP* to generate the <tt>pauplog.txt</tt> file, then look through that file to see how many replicates yielded bat monophyly.<br />
<br />
== Final exercise ==<br />
<br />
Once you confirm that your scripts are working, run your <tt>bats.bf</tt> using HyPhy followed by running <tt>parsimony.nex</tt> in PAUP* a total of 6 times, each with a different value of <tt>highATfreqs</tt> that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both <tt>bats.bf</tt> and <tt>parsimony.nex</tt> so that you get more accurate y-axis values. <br />
<br />
Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file <tt>pauplog.txt</tt>:<br />
cat simdata/pauplog.txt | grep "Number of trees retained by filter"<br />
The <tt>cat</tt> command simply dumps a file to the screen. Instead of sending the output to the console, the (<tt>|</tt>) causes the output to instead be piped into the <tt>grep</tt> command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.<br />
<br />
How does your plot compare to the one published in Vandenbussche et al. (1998)?<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41285Phylogenetics: HyPhy Lab2020-02-24T20:40:53Z<p>Paul Lewis: /* Loading modules needed */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.5.4<br />
module load paup/4.0a-166<br />
module load python/2.7.8<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
betat = 1.0;<br />
scalingFactor = 0.0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
<br />
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.betat := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.betat := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;<br />
constrainedTree.Pteropus.betat := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
=== Constructing the likelihood function ===<br />
<br />
/*****************************************************************************************<br />
| Build the likelihood function and print it out, which causes HyPhy to actually compute<br />
| the log-likelihood for the tree.<br />
*/<br />
LikelihoodFunction likelihood = (filteredData, constrainedTree);<br />
fprintf(stdout, likelihood);<br />
<br />
If you add the 2 lines above to your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.<br />
<br />
The <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.<br />
<br />
=== Ready to simulate! ===<br />
<br />
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.<br />
<br />
Here is the simulation loop:<br />
/*****************************************************************************************<br />
| Perform the simulations. <br />
*/<br />
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {<br />
// Simulate a data set of the same size as the original set<br />
DataSet simulatedData = SimulateDataSet(likelihood);<br />
<br />
// The filter is necessary, but trivial in this case because alls sites are used <br />
DataSetFilter filteredSimData = CreateFilter(simulatedData,1);<br />
<br />
// Save simulated data to a file<br />
outFile = "simdata/sim"+simCounter+".nex";<br />
fprintf(outFile, filteredSimData);<br />
}<br />
<br />
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.<br />
<br />
== Using PAUP* to estimate parsimony trees for each simulated data set ==<br />
<br />
HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.<br />
<br />
Create the following file in the same directory as <tt>bats.bf</tt>. Call the new file <tt>parsimony.nex</tt>.<br />
#nexus<br />
<br />
begin paup;<br />
log file=simdata/pauplog.txt start replace;<br />
set criterion=parsimony warnreset=no warntsave=no;<br />
end;<br />
<br />
begin python;<br />
for i in range(10):<br />
paup.cmd("exe simdata/sim%d.nex;" % i)<br />
paup.cmd("hsearch;")<br />
paup.cmd("contree all;")<br />
paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")<br />
paup.cmd("filter constraints=flyingdna;")<br />
end;<br />
<br />
begin paup;<br />
log stop;<br />
quit;<br />
end;<br />
<br />
This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.<br />
<br />
The first paup block starts a log (which will be created in the <tt>simdata</tt> directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.<br />
<br />
The final paup bock just closes the log file and then quits.<br />
<br />
The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a <tt>paup.cmd</tt> function call. <br />
<br />
The first <tt>paup.cmd</tt> executes a file named <tt>sim%d.nex</tt> inside the simdata directory, where <tt>%d</tt> is replaced by the value of <tt>i</tt>. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.<br />
<br />
The second <tt>paup.cmd</tt> performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.<br />
<br />
The third <tt>paup.cmd</tt> creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.<br />
<br />
We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named <tt>flyingdna</tt> and then filter the trees resulting from the parsimony search using the <tt>flyingdna</tt> constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.<br />
<br />
Run this file in PAUP* to generate the <tt>pauplog.txt</tt> file, then look through that file to see how many replicates yielded bat monophyly.<br />
<br />
== Final exercise ==<br />
<br />
Once you confirm that your scripts are working, run your <tt>bats.bf</tt> using HyPhy followed by running <tt>parsimony.nex</tt> in PAUP* a total of 6 times, each with a different value of <tt>highATfreqs</tt> that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both <tt>bats.bf</tt> and <tt>parsimony.nex</tt> so that you get more accurate y-axis values. <br />
<br />
Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file <tt>pauplog.txt</tt>:<br />
cat simdata/pauplog.txt | grep "Number of trees retained by filter"<br />
The <tt>cat</tt> command simply dumps a file to the screen. Instead of sending the output to the console, the (<tt>|</tt>) causes the output to instead be piped into the <tt>grep</tt> command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.<br />
<br />
How does your plot compare to the one published in Vandenbussche et al. (1998)?<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41284Phylogenetics: HyPhy Lab2020-02-24T20:31:29Z<p>Paul Lewis: /* Final exercise */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
betat = 1.0;<br />
scalingFactor = 0.0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
<br />
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.betat := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.betat := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;<br />
constrainedTree.Pteropus.betat := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
=== Constructing the likelihood function ===<br />
<br />
/*****************************************************************************************<br />
| Build the likelihood function and print it out, which causes HyPhy to actually compute<br />
| the log-likelihood for the tree.<br />
*/<br />
LikelihoodFunction likelihood = (filteredData, constrainedTree);<br />
fprintf(stdout, likelihood);<br />
<br />
If you add the 2 lines above to your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.<br />
<br />
The <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.<br />
<br />
=== Ready to simulate! ===<br />
<br />
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.<br />
<br />
Here is the simulation loop:<br />
/*****************************************************************************************<br />
| Perform the simulations. <br />
*/<br />
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {<br />
// Simulate a data set of the same size as the original set<br />
DataSet simulatedData = SimulateDataSet(likelihood);<br />
<br />
// The filter is necessary, but trivial in this case because alls sites are used <br />
DataSetFilter filteredSimData = CreateFilter(simulatedData,1);<br />
<br />
// Save simulated data to a file<br />
outFile = "simdata/sim"+simCounter+".nex";<br />
fprintf(outFile, filteredSimData);<br />
}<br />
<br />
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.<br />
<br />
== Using PAUP* to estimate parsimony trees for each simulated data set ==<br />
<br />
HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.<br />
<br />
Create the following file in the same directory as <tt>bats.bf</tt>. Call the new file <tt>parsimony.nex</tt>.<br />
#nexus<br />
<br />
begin paup;<br />
log file=simdata/pauplog.txt start replace;<br />
set criterion=parsimony warnreset=no warntsave=no;<br />
end;<br />
<br />
begin python;<br />
for i in range(10):<br />
paup.cmd("exe simdata/sim%d.nex;" % i)<br />
paup.cmd("hsearch;")<br />
paup.cmd("contree all;")<br />
paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")<br />
paup.cmd("filter constraints=flyingdna;")<br />
end;<br />
<br />
begin paup;<br />
log stop;<br />
quit;<br />
end;<br />
<br />
This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.<br />
<br />
The first paup block starts a log (which will be created in the <tt>simdata</tt> directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.<br />
<br />
The final paup bock just closes the log file and then quits.<br />
<br />
The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a <tt>paup.cmd</tt> function call. <br />
<br />
The first <tt>paup.cmd</tt> executes a file named <tt>sim%d.nex</tt> inside the simdata directory, where <tt>%d</tt> is replaced by the value of <tt>i</tt>. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.<br />
<br />
The second <tt>paup.cmd</tt> performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.<br />
<br />
The third <tt>paup.cmd</tt> creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.<br />
<br />
We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named <tt>flyingdna</tt> and then filter the trees resulting from the parsimony search using the <tt>flyingdna</tt> constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.<br />
<br />
Run this file in PAUP* to generate the <tt>pauplog.txt</tt> file, then look through that file to see how many replicates yielded bat monophyly.<br />
<br />
== Final exercise ==<br />
<br />
Once you confirm that your scripts are working, run your <tt>bats.bf</tt> using HyPhy followed by running <tt>parsimony.nex</tt> in PAUP* a total of 6 times, each with a different value of <tt>highATfreqs</tt> that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both <tt>bats.bf</tt> and <tt>parsimony.nex</tt> so that you get more accurate y-axis values. <br />
<br />
Note that you can use a command like that below to pull out only the lines that report the number of trees retained from the file <tt>pauplog.txt</tt>:<br />
cat simdata/pauplog.txt | grep "Number of trees retained by filter"<br />
The <tt>cat</tt> command simply dumps a file to the screen. Instead of sending the output to the console, the (<tt>|</tt>) causes the output to instead be piped into the <tt>grep</tt> command, which filters out everything except lines that contain the supplied string. This makes it easy to peform your counts.<br />
<br />
How does your plot compare to the one published in Vandenbussche et al. (1998)?<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41283Phylogenetics: HyPhy Lab2020-02-24T20:25:04Z<p>Paul Lewis: /* Ready to simulate! */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
betat = 1.0;<br />
scalingFactor = 0.0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
<br />
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.betat := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.betat := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;<br />
constrainedTree.Pteropus.betat := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
=== Constructing the likelihood function ===<br />
<br />
/*****************************************************************************************<br />
| Build the likelihood function and print it out, which causes HyPhy to actually compute<br />
| the log-likelihood for the tree.<br />
*/<br />
LikelihoodFunction likelihood = (filteredData, constrainedTree);<br />
fprintf(stdout, likelihood);<br />
<br />
If you add the 2 lines above to your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.<br />
<br />
The <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.<br />
<br />
=== Ready to simulate! ===<br />
<br />
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.<br />
<br />
Here is the simulation loop:<br />
/*****************************************************************************************<br />
| Perform the simulations. <br />
*/<br />
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {<br />
// Simulate a data set of the same size as the original set<br />
DataSet simulatedData = SimulateDataSet(likelihood);<br />
<br />
// The filter is necessary, but trivial in this case because alls sites are used <br />
DataSetFilter filteredSimData = CreateFilter(simulatedData,1);<br />
<br />
// Save simulated data to a file<br />
outFile = "simdata/sim"+simCounter+".nex";<br />
fprintf(outFile, filteredSimData);<br />
}<br />
<br />
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.<br />
<br />
== Using PAUP* to estimate parsimony trees for each simulated data set ==<br />
<br />
HyPhy is great for estimating parameters on a tree using quite complex models; however, PAUP* excels at searching for the best tree, so we will leave HyPhy now and turn to PAUP* to finish our work.<br />
<br />
Create the following file in the same directory as <tt>bats.bf</tt>. Call the new file <tt>parsimony.nex</tt>.<br />
#nexus<br />
<br />
begin paup;<br />
log file=simdata/pauplog.txt start replace;<br />
set criterion=parsimony warnreset=no warntsave=no;<br />
end;<br />
<br />
begin python;<br />
for i in range(10):<br />
paup.cmd("exe simdata/sim%d.nex;" % i)<br />
paup.cmd("hsearch;")<br />
paup.cmd("contree all;")<br />
paup.cmd("constraints flyingdna (monophyly) = ((PTEROPUS, 'TONATIA_BIDENS', 'TONATIA_SILVICOLA'));")<br />
paup.cmd("filter constraints=flyingdna;")<br />
end;<br />
<br />
begin paup;<br />
log stop;<br />
quit;<br />
end;<br />
<br />
This file contains 2 paup blocks with a python block sandwiched in between. That's right: PAUP* can execute python commands, and this comes in handy when you want to do the same thing over and over again, such as process a bunch of simulated data files in the exact same way.<br />
<br />
The first paup block starts a log (which will be created in the <tt>simdata</tt> directory) and sets the optimality criterion to parsimony. It also tells PAUP* to not warn us when data files are reset or when we try to quit when there are still trees that haven't been saved.<br />
<br />
The final paup bock just closes the log file and then quits.<br />
<br />
The python block is where all the interesting things happen. As with python in general, be sure you have the indentation of lines correct inside the python block, otherwise you will get an error message from Python. You can see that the block is one big loop over simulation replicates. Commands that you want PAUP* to execute have to be created as strings (inside double quotes) and then passed to PAUP* via a <tt>paup.cmd</tt> function call. <br />
<br />
The first <tt>paup.cmd</tt> executes a file named <tt>sim%d.nex</tt> inside the simdata directory, where <tt>%d</tt> is replaced by the value of <tt>i</tt>. Thus, the loop will, in turn, execute the files simdata/sim0.nex, simdata/sim1.nex, ..., simdata/sim10.nex.<br />
<br />
The second <tt>paup.cmd</tt> performs a heuristic search with all default settings. You could make this more explicit/elaborate if you wished, but the default settings work well in this case.<br />
<br />
The third <tt>paup.cmd</tt> creates (and shows) a strict consensus tree of all most parsimonious trees found during the search. There are often several best trees found during a parsimony search, and this shows us what is common to all these trees.<br />
<br />
We could leave it at that, but the last two lines make it easier to tally how many simulation replicates resulted in a parsimony tree in which all bats form a monophyletic group. We first create a monophyly constraint named <tt>flyingdna</tt> and then filter the trees resulting from the parsimony search using the <tt>flyingdna</tt> constraint. Trees that satisfy the constraint are kept while all trees in which bats are not monophyletic are discarded. If any trees remain after filtering, we will count 1; if no trees remain after filtering, we will count 0. The total count divided by the number of simulation replicates will give us the y-value for the plot that recreates Figure 4 from the Vandenbussche et al. paper.<br />
<br />
Run this file in PAUP* to generate the <tt>pauplog.txt</tt> file, then look through that file to see how many replicates yielded bat monophyly.<br />
<br />
== Final exercise ==<br />
<br />
Once you confirm that your scripts are working, run your <tt>bats.bf</tt> using HyPhy followed by running <tt>parsimony.nex</tt> in PAUP* a total of 6 times, each with a different value of <tt>highATfreqs</tt> that reflects one of these AT percentages: 50, 60, 70, 80, 90, 100. You may also wish to bump up the number of simulation replicates to at least 20 or 50 in both <tt>bats.bf</tt> and <tt>parsimony.nex</tt> so that you get more accurate y-axis values. How does your plot compare to the one published in Vandenbussche et al. (1998)?<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41282Phylogenetics: HyPhy Lab2020-02-24T19:56:52Z<p>Paul Lewis: /* Ready to simulate! */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
betat = 1.0;<br />
scalingFactor = 0.0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
<br />
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.betat := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.betat := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;<br />
constrainedTree.Pteropus.betat := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
=== Constructing the likelihood function ===<br />
<br />
/*****************************************************************************************<br />
| Build the likelihood function and print it out, which causes HyPhy to actually compute<br />
| the log-likelihood for the tree.<br />
*/<br />
LikelihoodFunction likelihood = (filteredData, constrainedTree);<br />
fprintf(stdout, likelihood);<br />
<br />
If you add the 2 lines above to your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.<br />
<br />
The <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.<br />
<br />
=== Ready to simulate! ===<br />
<br />
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.<br />
<br />
Here is the simulation loop:<br />
/*****************************************************************************************<br />
| Perform the simulations. <br />
*/<br />
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {<br />
// Simulate a data set of the same size as the original set<br />
DataSet simulatedData = SimulateDataSet(likelihood);<br />
<br />
// The filter is necessary, but trivial in this case because alls sites are used <br />
DataSetFilter filteredSimData = CreateFilter(simulatedData,1);<br />
<br />
// Save simulated data to a file<br />
outFile = "simdata/sim"+simCounter+".nex";<br />
fprintf(outFile, filteredSimData);<br />
}<br />
<br />
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41281Phylogenetics: HyPhy Lab2020-02-24T19:56:31Z<p>Paul Lewis: /* Fixing up edge lengths */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
betat = 1.0;<br />
scalingFactor = 0.0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
<br />
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.betat := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.betat := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;<br />
constrainedTree.Pteropus.betat := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
=== Constructing the likelihood function ===<br />
<br />
/*****************************************************************************************<br />
| Build the likelihood function and print it out, which causes HyPhy to actually compute<br />
| the log-likelihood for the tree.<br />
*/<br />
LikelihoodFunction likelihood = (filteredData, constrainedTree);<br />
fprintf(stdout, likelihood);<br />
<br />
If you add the 2 lines above to your growing bats.bf file and run it, you should see that the log-likelihood is (to 6 decimal places) equal to <tt>-6472.478318</tt>.<br />
<br />
The <tt>ASSUME_REVERSIBLE_MODELS = -1</tt> that we placed at the beginning of the batch file is needed to prevent HyPhy from assuming it can reroot the tree at any node it wants (which leads to trouble because the nucleotide composition changes across the tree and thus the rooting matters). The <tt>ACCEPT_ROOTED_TREES = TRUE</tt> at the top of the file prevents HyPhy from automatically converting our tree description (which specifies a rooted tree) into an unrooted tree.<br />
<br />
=== Ready to simulate! ===<br />
<br />
We are now ready to perform the simulations. As soon as you create a <tt>LikelihoodFunction</tt> object in HyPhy that is capable of computing the log-likelihood, that object can be used to simulate data because it has a tree with branches that have an assigned model of evolution and all parameters of the substitution models have been either specified (as we did here) or estimated.<br />
<br />
Here is the simulation loop:<br />
/*****************************************************************************************<br />
| Perform the simulations. <br />
*/<br />
for (simCounter = 0; simCounter < 10; simCounter = simCounter+1) {<br />
// Simulate a data set of the same size as the original set<br />
DataSet simulatedData = SimulateDataSet(likelihood);<br />
<br />
// The filter is necessary, but trivial in this case because alls sites are used <br />
DataSetFilter filteredSimData = CreateFilter(simulatedData,1);<br />
<br />
// Save simulated data to a file<br />
outFile = "simdata/sim"+simCounter+".nex";<br />
fprintf(outFile, filteredSimData);<br />
}<br />
<br />
I have used both styles of comments here: the main comment before the loop is done using the <tt>/* ... */</tt> style, and the double-slash style is used for comments within the loop. The <tt>LikelihoodFunction</tt> object likelihood is passed to the <tt>SimulateDataSet</tt> function to generate the data. It is always necessary to create a <tt>DataSetFilter</tt> in HyPhy, even if no filtering occurs. If one prints <tt>filteredSimData</tt> using <tt>fprintf</tt> to <tt>stdout</tt>, then the entire data file would be spewed to the screen. That's not very helpful. Here we are giving <tt>fprintf</tt> a file name rather than <tt>stdout</tt>, which causes the simulated data to be saved to that file. The file name is constructed using <tt>simCounter</tt> so that every simulated file has a name that is different. Note that all simulated data will be saved in the <tt>simdata</tt> directory that you created early on in this tutorial.<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41280Phylogenetics: HyPhy Lab2020-02-24T19:35:33Z<p>Paul Lewis: /* Fixing up edge lengths */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
betat = 1.0;<br />
scalingFactor = 0.0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
scalingFactor = scalingFactor + highATfreqs[n1]*highATfreqs[n2]*HKY85RateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
<br />
The two nested loops visit every off-diagonal element of the <tt>HKY85RateMatrix</tt>, multiply that element by the base frequency of its row and the base frequency of its column. This product of 3 terms is then added to the growing sum <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why <tt>scalingFactor</tt> is computed in this way. Note that it is important to set <tt>betat = 1</tt> before the loop in order to ensure that the scaling factor works correctly.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.betat := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.betat := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.betat := 0.000000/scalingFactor;<br />
constrainedTree.Pteropus.betat := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41279Phylogenetics: HyPhy Lab2020-02-24T19:23:28Z<p>Paul Lewis: /* Fixing up edge lengths */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>betat</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: that is, <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values. It is not necessary to create a function for this, but I will do it that way so in order to illlustrate how functions are defined in HBL:<br />
<br />
function computeScalingFactor(rateMatrix, baseFreqs) {<br />
sf = 0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
sf = sf + baseFreqs[n1]*baseFreqs[n2]*rateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
return sf;<br />
}<br />
scalingFactor = computeScalingFactor(HKY85RateMatrix, highATfreqs);<br />
<br />
The last line calls the function, passing the ''arguments'' <tt>HKY85RateMatrix</tt> and <tt>highATfreqs</tt> into the function. The function's ''parameters'' (<tt>rateMatrix</tt> and <tt>baseFreqs</tt>) are arbitrary names that are used within the function body. Thus, within the function, <tt>rateMatrix</tt> is used for the actual rate matrix <tt>HKY85RateMatrix</tt> that was passed into the function. (Defining a function would be more useful if we had to compute scaling factors for many different models.) The function body goes through two nested loops that visit every off-diagonal element of the rate matrix, multiply it by the base frequency of that row and the base frequency of that column and add this product of 3 terms to the growing sum <tt>sf</tt>. The function returns the value <tt>sf</tt>, which is used to set the value of the variable <tt>scalingFactor</tt>. Review your lecture notes if you don't remember why the scaling factor is computed in this way.<br />
<br />
Now we simply need to set the <tt>betat</tt> values for the four edges of interest:<br />
constrainedTree.microbats.beta := 0.097851/scalingFactor;<br />
constrainedTree.Tonatia_bidens.beta := 0.008252/scalingFactor;<br />
constrainedTree.Tonatia_silvicola.beta := 0.0001/scalingFactor;<br />
constrainedTree.Pteropus.beta := 0.104663/scalingFactor;<br />
<br />
Place the loop that shows edge lengths after these 4 lines in order to check and make sure the edge lengths have been set correctly.<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41278Phylogenetics: HyPhy Lab2020-02-24T19:11:38Z<p>Paul Lewis: /* Define the HKY85 rate matrix */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable <tt>betat</tt> was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>beta</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
function computeScalingFactor(rateMatrix, baseFreqs) {<br />
sf = 0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
sf = sf + baseFreqs[n1]*baseFreqs[n2]*rateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
return sf;<br />
}<br />
scalingFactor = computeScalingFactor(HKY85RateMatrix, highATfreqs);<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41277Phylogenetics: HyPhy Lab2020-02-24T19:11:06Z<p>Paul Lewis: /* Define the HKY85 rate matrix */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , betat , betat*kappa , betat }<br />
{ betat , * , betat , betat*kappa }<br />
{ betat*kappa , betat , * , betat }<br />
{ betat , betat*kappa , betat , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set <tt>betat = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>beta</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
function computeScalingFactor(rateMatrix, baseFreqs) {<br />
sf = 0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
sf = sf + baseFreqs[n1]*baseFreqs[n2]*rateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
return sf;<br />
}<br />
scalingFactor = computeScalingFactor(HKY85RateMatrix, highATfreqs);<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41276Phylogenetics: HyPhy Lab2020-02-24T19:09:35Z<p>Paul Lewis: /* Don't get too comfortable just yet */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , beta , beta*kappa , beta }<br />
{ beta , * , beta , beta*kappa }<br />
{ beta*kappa , beta , * , beta }<br />
{ beta , beta*kappa , beta , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set <tt>beta = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
=== Fixing up edge lengths ===<br />
<br />
The problem is that applying a new model to the 4 edge lengths caused the edge length information to be erased! We need to now set the <tt>beta</tt> parameter for each of those edges to be compatible with the edge lengths that were originally there. This process is a bit more tedious than I would like, so you'll have to bear with me through this next section. The <tt>AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1</tt> that we placed at the top of the file saves us from having to go through this procedure for all the <tt>hky1</tt> edges, but now that we've erased the models from 4 edges, we need to do a bit of work to get the edge lengths back.<br />
<br />
Recall that, under the JC69 model, the edge length equals <tt>v = 3*beta*t</tt>. The <tt>betat</tt> parameter that appears in our <tt>HKY85RateMatrix</tt> is the <tt>beta*t</tt> part, so to set the <tt>betat</tt> parameter we need to divide the desired edge length by 3: <tt>betat = v/3</tt>. The scaling factor 3 becomes more complex for the HKY85 model, but the principle is the same. Our first goal is thus to compute the scaling factor we need to convert edge lengths to <tt>betat</tt> values.<br />
<br />
function computeScalingFactor(rateMatrix, baseFreqs) {<br />
sf = 0;<br />
for (n1 = 0; n1 < 4; n1 = n1+1) {<br />
for (n2 = 0; n2 < 4; n2 = n2+1) {<br />
if (n2!=n1) {<br />
sf = sf + baseFreqs[n1]*baseFreqs[n2]*rateMatrix[n1][n2];<br />
}<br />
}<br />
}<br />
return sf;<br />
}<br />
scalingFactor = computeScalingFactor(HKY85RateMatrix, highATfreqs);<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41275Phylogenetics: HyPhy Lab2020-02-24T18:53:20Z<p>Paul Lewis: /* Don't get too comfortable just yet */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , beta , beta*kappa , beta }<br />
{ beta , * , beta , beta*kappa }<br />
{ beta*kappa , beta , * , beta }<br />
{ beta , beta*kappa , beta , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set <tt>beta = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", edgeNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41274Phylogenetics: HyPhy Lab2020-02-24T18:52:48Z<p>Paul Lewis: /* Don't get too comfortable just yet */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , beta , beta*kappa , beta }<br />
{ beta , * , beta , beta*kappa }<br />
{ beta*kappa , beta , * , beta }<br />
{ beta , beta*kappa , beta , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set <tt>beta = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Let's print out all the edge lengths in the tree:<br />
<br />
fprintf(stdout, "\n\nCurrent edge lengths:\n");<br />
edgeNames = BranchName(constrainedTree,-1); <br />
edgeLengths = BranchLength(constrainedTree,-1);<br />
for (k = 0; k < Columns(edgeNames) - 1; k = k + 1) {<br />
fprintf(stdout, Format(edgeLengths[k],10,5), " ", branchNames[k], "\n");<br />
}<br />
<br />
The first line simply skips a couple of lines (<tt>\n\n</tt>) and prints a header announcing that current edge lengths will follow.<br />
<br />
The second and third lines ask HyPhy for the names of all branches and the lengths of all branches. The <tt>-1</tt> in these functions is somewhat obscure, but means "give me all of them". (If you used <tt>5</tt> rather than <tt>-1</tt>, it would give you the name of the edge having index 5.)<br />
<br />
The last 3 lines are a loop in which the variable <tt>k</tt> ranges from 0 to the number of edges minus 1. For each value of <tt>k</tt>, the <tt>fprintf</tt> statement prints out the length of edge <tt>k</tt> (the <tt>Format</tt> command causes it to use exactly 10 spaces and 5 decimal places) followed by a couple of spaces and then the name of edge <tt>k</tt>. The newline character (<tt>"\n"</tt>) at the end of the <tt>fprintf</tt> statement causes a carriage return so that the edge lengths and names do not all end up on the same line of output.<br />
<br />
After running <tt>bats.bf</tt>, what possibly important detail do you notice about the lengths of the edges to which we attached the <tt>hky2</tt> model?<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41273Phylogenetics: HyPhy Lab2020-02-24T18:35:00Z<p>Paul Lewis: /* Creating the second model */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , beta , beta*kappa , beta }<br />
{ beta , * , beta , beta*kappa }<br />
{ beta*kappa , beta , * , beta }<br />
{ beta , beta*kappa , beta , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set <tt>beta = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky2",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky2",<br />
"Tonatia_silvicola":"hky2",<br />
"Tupaia":"hky1",<br />
"microbats":"hky2"<br />
}<br />
<br />
=== Don't get too comfortable just yet ===<br />
<br />
It is great to sit back and admire your work thus far, but there is a small problem looming just below the surface. Try <tt>fprintf</tt>ing the tree:<br />
<br />
fprintf(stdout, "\nHere is the tree:");<br />
fprintf(stdout, constrainedTree);<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41272Phylogenetics: HyPhy Lab2020-02-24T18:30:54Z<p>Paul Lewis: /* Create a tree representing the null hypothesis */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , beta , beta*kappa , beta }<br />
{ beta , * , beta , beta*kappa }<br />
{ beta*kappa , beta , * , beta }<br />
{ beta , beta*kappa , beta , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set <tt>beta = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
Tree constrainedTree = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0)microbats:0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";;<br />
<br />
=== Creating the second model ===<br />
<br />
Hopefully everything we've done so far in HyPhy makes sense. Now comes the tricky part! We need to create a second HKY85 model that has elevated A and T frequencies, and that model needs to be applied to only certain edges in the tree. Right now, the <tt>hky1</tt> model has been applied to every node in the tree. You can verify this using the following HBL code:<br />
<br />
GetInformation(modelMap, constrainedTree);<br />
fprintf(stdout, modelMap);<br />
<br />
Add these two lines to the bottom of your <tt>bats.bf</tt> file and run it. You should see this near the bottom of your output:<br />
{<br />
"Bos":"hky1",<br />
"Cynocephalus":"hky1",<br />
"Didelphis":"hky1",<br />
"Felis":"hky1",<br />
"Galago":"hky1",<br />
"Homo":"hky1",<br />
"Mus":"hky1",<br />
"Node1":"hky1",<br />
"Node10":"hky1",<br />
"Node11":"hky1",<br />
"Node14":"hky1",<br />
"Node19":"hky1",<br />
"Node2":"hky1",<br />
"Node3":"hky1",<br />
"Node4":"hky1",<br />
"Node5":"hky1",<br />
"Node7":"hky1",<br />
"Oryctolagus":"hky1",<br />
"Pteropus":"hky1",<br />
"Tarsius":"hky1",<br />
"Tonatia_bidens":"hky1",<br />
"Tonatia_silvicola":"hky1",<br />
"Tupaia":"hky1"<br />
}<br />
<br />
First, create a second model named <tt>hky2</tt> and apply it to four specific edges in the tree:<br />
<br />
/*****************************************************************************************<br />
| Define a second AT-rich HKY85 model named hky2 and apply it to selected edges. <br />
*/<br />
highATfreqs = {{.45}{.05}{.05}{.45}};<br />
Model hky2 = (HKY85RateMatrix, highATfreqs);<br />
SetParameter(constrainedTree.microbats, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_bidens, MODEL, hky2);<br />
SetParameter(constrainedTree.Tonatia_silvicola, MODEL, hky2);<br />
SetParameter(constrainedTree.Pteropus, MODEL, hky2);<br />
<br />
If you move your <tt>GetInformation</tt> call and its accompanying <tt>fprintf</tt> after these 6 lines, then you should see this in the output: <br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41271Phylogenetics: HyPhy Lab2020-02-24T18:17:17Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf ===<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix ===<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , beta , beta*kappa , beta }<br />
{ beta , * , beta , beta*kappa }<br />
{ beta*kappa , beta , * , beta }<br />
{ beta , beta*kappa , beta , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set <tt>beta = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model ===<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis ===<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
constrainedTopology = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0):0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";<br />
Tree constrainedTree = constrainedTopology;<br />
<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41270Phylogenetics: HyPhy Lab2020-02-24T18:16:25Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
=== Run bats.bf<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
The <tt>fprintf</tt> command also mimics the C programming language. It allows you to print objects to <tt>stdout</tt> (standard output). This is a useful tool for performing sanity checks, such as checking to ensure that the frequencies were indeed harvested and stored in the variable <tt>observedFreqs</tt>.<br />
<br />
=== Define the HKY85 rate matrix<br />
<br />
Add the following to the bottom of your <tt>bats.bf</tt> file:<br />
/*****************************************************************************************<br />
| Define the KHY substitution matrix. '*' is used for the diagonal elements that can be<br />
|.computed automatically by HyPhy. The transition-transversion rate ratio (kappa) is <br />
| declared to be global, meaning it is shared by all edges.<br />
*/<br />
global kappa = 4.224618;<br />
HKY85RateMatrix = <br />
{{ * , beta , beta*kappa , beta }<br />
{ beta , * , beta , beta*kappa }<br />
{ beta*kappa , beta , * , beta }<br />
{ beta , beta*kappa , beta , * }};<br />
fprintf(stdout, HKY85RateMatrix);<br />
<br />
Run <tt>bats.bf</tt> in HyPhy. Did the <tt>HKY85RateMatrix</tt> variable have the value you expected? Why or why not? (Hint: the variable beta was initialized to zero.) Set <tt>beta = 1;</tt> before the <tt>fprintf</tt> statement and run the batch file again to see the effect. Does the output make sense now?<br />
<br />
=== Combine frequencies with rate matrix to create an HKY85 model<br />
<br />
/*****************************************************************************************<br />
| Define the HKY85 model, by combining the substitution matrix with the vector of <br />
| empirical frequencies. <br />
*/<br />
Model hky1 = (HKY85RateMatrix, observedFreqs);<br />
<br />
Now we have a model variable (<tt>hky1</tt>) that can be applied to each edge of a tree. <br />
<br />
=== Create a tree representing the null hypothesis<br />
<br />
The next step is to create the model tree that we will use for the simulations. The tree topology is from Figure 2a in the Vandenbussche et al. paper. I have estimated the edge lengths and transition/transversion rate ratio (kappa) using PAUP*.<br />
<br />
/*****************************************************************************************<br />
| Define the tree variable, using the tree description read from the data file.<br />
| By default, the last defined model (hky1) is assigned to all edges of the tree. <br />
*/<br />
constrainedTopology = "((((((Homo:0.077544,(Tarsius:0.084863,Galago:0.075292):0.009462):0.026367,((Cynocephalus:0.067955,Tupaia:0.093035):0.016468,(Oryctolagus:0.093866,Mus:0.143079):0.013506):0.017052,Pteropus:0.102675):0.008768,Bos:0.099273):0.007976,(Tonatia_bidens:0.008137,Tonatia_silvicola:0):0.096022):0.013987,Felis:0.044428):0.043248,Didelphis:0.247617)";<br />
Tree constrainedTree = constrainedTopology;<br />
<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41269Phylogenetics: HyPhy Lab2020-02-24T17:38:25Z<p>Paul Lewis: /* Creating the HyPhy batch file */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
### Run bats.bf<br />
<br />
Run your <tt>bats.bf</tt> file as follows:<br />
hyphy bats.bf<br />
<br />
You should see the empirical nucleotide frequencies displayed as follows:<br />
{<br />
{0.1936054742803209}<br />
{0.3104845052697813}<br />
{0.3106418121755545}<br />
{0.1852682082743432}<br />
}<br />
<br />
I have provided copious comments in the batch file to explain what each command is doing. Comments in HyPhy batch files follow the C programming convention: either surround the comment with slash-asterisk delimiters (<tt>/* comment */</tt>) or begin the line with two slashes (<tt>// comment</tt>). The three lines at the beginning will be explained when each becomes relevant.<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41268Phylogenetics: HyPhy Lab2020-02-24T17:32:14Z<p>Paul Lewis: /* Creating the HyPhy batch file */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(stdout, observedFreqs);<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41267Phylogenetics: HyPhy Lab2020-02-24T17:25:53Z<p>Paul Lewis: /* Creating the HyPhy batch file */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
fprintf(observedFreqs);<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41266Phylogenetics: HyPhy Lab2020-02-24T17:18:21Z<p>Paul Lewis: /* Creating the HyPhy batch file */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy has its own scripting language known as the HyPhy Batch Language (HBL). HyPhy can be run from the command line to carry out phylogenetic analyses that are scripted in HBL, much like running Python to interpret a Python script. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41255Phylogenetics: HyPhy Lab2020-02-24T15:26:28Z<p>Paul Lewis: /* Creating the HyPhy batch file */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy is its own scripting language. Like Python, a script written in HyPhy's batch language (HBL) can be run from the command line to carry out phylogenetic analyses. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. <br />
<br />
Download the data we will use for this analysis using curl:<br />
<br />
cd ~/bats<br />
curl -LO http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/irbp.nex<br />
<br />
This is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
irbp.nex<br />
<br />
Open <tt>bats.bf</tt> in your favorite text editor (e.g. nano) and enter the following text:<br />
<br />
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1; <br />
ACCEPT_ROOTED_TREES = TRUE; <br />
ASSUME_REVERSIBLE_MODELS = -1;<br />
<br />
/*****************************************************************************************<br />
| Read in the data and store the result in the variable nucleotideSequences.<br />
*/<br />
<br />
DataSet nucleotideSequences = ReadDataFile("irbp.nex");<br />
<br />
/*****************************************************************************************<br />
| Filter the data, specifying which sites are to be used. The first 1 means treat each <br />
| site separately (3 would cause the data to be interpreted as codons). The quoted <br />
| string (last argument) specifies which sites (where first site = 0) to use (we are <br />
| excluding sites with a lot of missing data and or alignment issues). This leaves us with<br />
| 978 sites rather than the 935 used by Vandenbussche, but it is impossible to determine<br />
| exactly which sites were excluded in the original study.<br />
*/<br />
<br />
DataSetFilter filteredData = CreateFilter(nucleotideSequences,1,"106-183,190-1089");<br />
<br />
/*****************************************************************************************<br />
| Store empirical nucleotide frequencies in the variable observedFreqs. The 1,1,1 means<br />
| unit=1, atom=1, position-specific=1. These settings create one global set of <br />
| frequencies (setting, for example, unit=3, atom=3, would tally 64 codon frequencies, <br />
| which is not what we need because we will not be using a codon model).<br />
*/<br />
<br />
HarvestFrequencies(observedFreqs, filteredData, 1, 1, 1);<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41254Phylogenetics: HyPhy Lab2020-02-24T14:50:53Z<p>Paul Lewis: /* Creating the HyPhy batch file */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy is its own scripting language. Like Python, a script written in HyPhy's batch language (HBL) can be run from the command line to carry out phylogenetic analyses. Start by creating a new file named <tt>bats.bf</tt> in a directory called <tt>bats</tt> (you can of course use any name you like, but these are the names I will be using). Also create a directory <tt>simdata</tt> inside your <tt>bats</tt> directory. So this is what your directory structure should look like at this point:<br />
bats/<br />
simdata/<br />
bats.bf<br />
<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41253Phylogenetics: HyPhy Lab2020-02-24T14:48:04Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allowing the model of evolution to change across a tree.<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== The goal of this lab ==<br />
<br />
In this lab, we will recreate the very interesting parametric bootstrapping analysis performed in this paper:<br />
<br />
RA Vandenbussche, RJ Baker, JP Huelsenbeck, and DM Hillis. 1998. Base compositional <br />
bias and phylogenetic analyses: a test of the "Flying DNA" hypothesis. Molecular <br />
Phylogenetics and Evolution 13(3): 408-416. [DOI:10.1006/mpev.1998.0531 https://doi.org/10.1006/mpev.1998.0531]<br />
<br />
In short, this paper demonstrated that the "Flying DNA" hypothesis proposed earlier by Pettigrew (1994. Flying DNA. Curr. Biol. 4: 277–280) was not viable. The Flying DNA hypothesis proposed that microbats and megabats are actually unrelated, but appear as a monophyletic group in phylogenetic trees due to the fact that both have high AT bias in the genes used to reconstruct phylogeny. The idea is that this strong nucleotide composition bias makes convergence much more probable, as there are effectively only two states (A and T) rather than four (A, C, G, T), and parsimony is mistaking such convergence as historical relatedness.<br />
<br />
The Vandenbussche et al. paper simulated data under the null hypothesis (micro- and mega-bats are unrelated) but added various amounts of AT bias when simulating the bat lineages. If Pettigrew was correct, trees reconstructed from such data should show bats monophyletic, even though they were not together in the true tree used for the simulation.<br />
<br />
This type of simulation required ad hoc software in 1998 because most software that can carry out simulations on phylogenetic trees assumes that the model is the same across the tree. Fortunately, these days we have HyPhy, which offers a way to simulate (and analyze) under pretty much any model you can imagine. <br />
<br />
Vandenbussche et al. used the K80 (kappa=4) model across most of the tree, but the lineage leading to Pteropus (the lone megabat in the analysis) and the lineages within the microbat clade (Tonatia bidens, Tonatia silvicola, and their stem lineage) used HKY85 with kappa=4 but nucleotide frequencies that are AT-biased (e.g. piA=0.4, piC=0.1, piG=0.1, piT=0.4). The question is: how much AT-bias does one need to put into the simulation in order to see the convergence that Pettigrew claimed was happening. Our goal will be to recreate the parsimony part of figure 4 from the Vandenbussche paper. We will use HyPhy to simulate the data, and PAUP* to do the parsimony analyses.<br />
<br />
== Creating the HyPhy batch file ==<br />
<br />
HyPhy is its own scripting language. Like Python, a script written in HyPhy's batch language (HBL) can be run from the command line to carry out phylogenetic analyses. Start by creating a file named bats<br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41252Phylogenetics: HyPhy Lab2020-02-24T14:24:46Z<p>Paul Lewis: /* Loading modules needed */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allow the model of evolution to change across a tree.<br />
<br />
== Obtaining the sequences ==<br />
<br />
A Nexus data file containing sequences and a tree is located here: [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex wickett.nex]. This dataset was assembled by former UConn EEB graduate student [http://www.chicagobotanic.org/research/staff/wickett Norm Wickett] and contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort <em>Aneura mirabilis</em>) that is non-green and does not photosynthesize. Today's lab will recreate the type of analysis Norm carried out in [http://dx.doi.org/10.1007/s00239-008-9133-1 his 2008 paper in Journal of Molecular Evolution (67:111-122)].<br />
<br />
The sequences are of a gene important for photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
<!--<br />
== HyPhy ==<br />
<br />
I have requested that the latest version of [HyPhy http://www.hyphy.org] be installed on the Xanadu cluster, but that hasn't happened yet. I have placed the executable file in the scratch folder. You should create a <tt>bin</tt> directory in your home directory (if you haven't already done so) and copy the hyphy "binary" file there:<br />
cd<br />
mkdir bin<br />
cp /scratch/phylogenetics/hyphy ~/bin<br />
--><br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
module load paup/4.0a-166<br />
<br />
== Loading data into HyPhy ==<br />
<br />
Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose ''File > Open > Open Data File'', then navigate to and select the <tt>wickett.nex</tt> data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the '''Data window''' from this point on.<br />
<br />
== Creating a partition ==<br />
<br />
HyPhy thinks of your data as being composed of one or more '''partitions'''. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene). <br />
<br />
=== The word partition is used in two ways ===<br />
The word partition is ambiguous: it formerly meant "wall" or "divider" but, with the advent of computer hard drives, it has also come to mean the space ''between'' the walls or dividers. When someone says they ''partitioned their data'', they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they ''applied a GTR+I+G model to the rbcL partition'', they have now switched to using the word partition to mean the sites on the rbcL side of the divider.<br />
<br />
=== No partitioning implies one partition! ===<br />
Even if you choose to '''not''' partition (old meaning) your data in HyPhy, you must go through the motions of creating a single partition (new meaning) because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose ''Edit > Select All'' from the Data window menu, then choose ''Data > Selection->Partition'' to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".<br />
<br />
=== Assign a data type to your partition ===<br />
Now that you have a partition, you can create a model for it. Under the column name ''Partition Type'', choose ''codon'' (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is ''Di-nucl.'', which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.<br />
<br />
=== Assign a tree topology to your partition ===<br />
Under Tree Topology, you have several options. Because a tree topology was defined in the <tt>wickett.nex</tt> data file, this tree topology shows up in the drop-down list as <tt>wickett_tree</tt>. Choose <tt>wickett_tree</tt> as the tree topology for your partition.<br />
<br />
=== Assign a substitution model to your partition ===<br />
The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is ''MG94xHKY85_3x4''. This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synonymous and non-synonymous substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.<br />
<br />
The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:<br />
* the frequency of A nucleotides at first positions<br />
* the frequency of G nucleotides at second positions<br />
* the frequency of T nucleotides at third positions<br />
(Note: HyPhy corrects these for the fact that the three stop codons are not included.)<br />
This involves estimating the '''4''' nucleotides frequencies at each of the '''3''' codon positions, hence the '''3x4''' in the name.<br />
<br />
=== Local vs. global ===<br />
You have only a couple more decisions to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. '''Local''' means that HyPhy will estimate some substitution model parameters for every branch in the tree. '''Global''' means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).<br />
<br />
Tell HyPhy to use the Local option (this should already be set correctly).<br />
<br />
=== Equilibrium frequencies===<br />
You should also leave the equilibrium frequencies set to "Partition". This sets the equilibrium base frequencies to the empirical values (i.e. the frequency of A is the number of As observed in the entire partition divided by the total number of nucleotides in the partition). Other options include:<br />
* Dataset, which would not be different than "Partition" in this case where there is only one partition defined, <br />
* Equal, which sets all base frequencies equal to 0.25, and<br />
* Estimate, which estimates the base frequencies<br />
<br />
== Computing the likelihood under a local codon model ==<br />
<br />
You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose ''Likelihood > Build Function'' to build a likelihood function, then ''Likelihood > Optimize'' to optimize the likelihood function (i.e. search for the highest point on the likelihood surface, thus obtaining maximum likelihood estimates of all parameters).<br />
<br />
=== Saving the results ===<br />
When HyPhy has finished optimizing (this will take several seconds to several minutes, depending on the speed of the computer you are using), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the '''Parameters window''') showing you values for all the quantities it estimated. <br />
<br />
Click on the '''HYPHY Console window''' to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions:<br />
<div style="background-color:#eeeeff">What is the maximum log-likelihood under this unconstrained model? {{title|-4203.47237161049|answer}}</div><br />
<div style="background-color:#eeeeff">How many shared (i.e. global) parameters does HyPhy say it estimated? {{title|1|answer}}</div><br />
<div style="background-color:#eeeeff">What are these global parameters? {{title|tree topology|answer}}</div><br />
<div style="background-color:#eeeeff">How many local parameters does HyPhy say it estimated? {{title|42|answer}}</div><br />
<div style="background-color:#eeeeff">What are these local parameters?'' (Hint: for n taxa, there are 2n-3 branches) {{title|synonymous and nonsynonymous rate for each of the 21 edges for 12 taxa|answer}}</div><br />
<br />
Switch back to the Parameters window now and look at the very bottom of the window to answer these questions:<br />
<div style="background-color:#eeeeff">What is the total number of parameters estimated? {{title|43|answer}}</div><br />
<div style="background-color:#eeeeff">What is the value of AIC reported by HyPhy? {{title|8492.944743220985|answer}}</div><br />
<div style="background-color:#eeeeff">Calculate the AIC yourself using this formula: AIC = -2*lnL + 2*nparams {{title|8492.944743221|answer}}</div><br />
<br />
Before moving on, save a snapshot of the likelihood function with the current parameter values by choosing "Save LF state" from the drop-down list box at the top of the Parameters window. Choose the name "unconstrained" when asked. After saving the state of the likelihood function, choose "Select as alternative" from the same drop-down list. This will allow us to easily perform likelihood ratio tests using another, simpler model as the null model.<br />
<br />
=== Viewing the tree and obtaining information about branches ===<br />
The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths). <br />
<br />
The next step is to compare the unconstrained model (in which there are the same number of omega parameters as there are branches) with simpler models involving fewer omega parameters. For example, one model you will use in a few minutes allows the three branches in the parasite clade to evolve under one omega, while all other branches evolve under an omega value that is potentially different. For future reference, you should determine now what name HyPhy is using for the branch leading to the two parasite taxa.<br />
<br />
Click on the branch leading to the two parasites. It should turn into a dotted line. Now double-click this branch and you should get a dialog box popping up with every bit of information known about this branch:<br />
<div style="background-color: #eeeeff">What is the branch id for this branch that leads to the two parasite sequences? {{title|Node10|answer}}</div><br />
You can now close the "Branch Info" dialog box.<br />
<br />
== Computing the likelihood under the most-constrained model ==<br />
Under the current (unconstrained) model, two parameters were estimated for each branch: the synonymous substitution rate and the nonsynonymous substitution rate. Now let's constrain each branch so that the ratio (omega) between the nonsynonymous rate and the synonymous rate is identical for all branches. <br />
<br />
To do this, first notice that each branch is represented by two parameters in the Parameter window. For example, the branch leading to Parasite_A is associated with these two parameters:<br />
wickett_tree.PARASITE_A.nonSynRate<br />
wickett_tree.PARASITE_A.synRate<br />
The goal is to constrain these two parameters so that the nonsynonymous rate is always omega times the synonymous rate, where omega is a new parameter shared by all branches.<br />
<br />
Select the two parameters listed above for the branch leading to PARASITE_A. (You can do this by single-clicking both parameters while simultaneously holding down the Shift key.) Once you have both parameters selected, click on the third button from the left at the top of the Parameters window. This is the button decorated with the symbol for proportionality. Clicking this button will produce a long list of possiblities: here is the one you should choose:<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
Once you select this option, HyPhy will ask for a name: type<br />
omega<br />
as the name of the new ratio.<br />
<br />
Now select the two parameters for a different pair of branches, say <br />
wickett_tree.PARASITE_B.nonSynRate<br />
wickett_tree.PARASITE_B.synRate<br />
Click the proportionality constraint button again, but this time choose<br />
wickett_tree.PARASITE_B.nonSynRate:=omega*wickett_tree.PARASITE_B.synRate<br />
Note that you can choose to use a constraint for other branches once you have defined it for one branch.<br />
<br />
Continue to apply this constraint to all 19 remaining branches. When you are finished, choose ''Likelihood > Optimize'' from the menu at the top of the Parameters window.<br />
<br />
== Performing a model comparison ==<br />
<br />
After HyPhy is finished optimizing the likelihood function, answer the following questions using the numbers at the bottom of the Parameters window:<br />
<div style="background-color: #eeeeff">What is the estimated value of the omega parameter? {{title|0.0247457593714435|answer}}</div><br />
<div style="background-color: #eeeeff">Does this value of omega imply stabilizing selection, neutral evolution or positive selection? {{title|stabilizing selection|answer}}</div><br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood of this (most-constrained) model? {{title|-4224.870964230792|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters are being estimated now? {{title|23|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy? {{title|8495.741928461584|answer}}</div><br />
<div style="background-color: #eeeeff">Does this most-constrained model fit the data better than the unconstrained model? {{title|no|answer}}</div><br />
<div style="background-color: #eeeeff">What is the difference between the log-likelihood of this (most-constrained) model and the log-likelihood of the previous (unconstrained) model? {{title|21.3986|answer}}</div><br />
<div style="background-color: #eeeeff">What is the likelihood ratio test statistic for this comparison? {{title|42.7972|answer}}</div> <br />
<div style="background-color: #eeeeff">How many degrees of freedom does this likelihood ratio test have? {{title|20|answer}}</div><br />
<div style="background-color: #eeeeff">Is the likelihood ratio test significant? (click [http://faculty.vassar.edu/lowry/tabs.html#csq here] for an online chi-square calculator) {{title|yes at significance level 0.00217427|answer}}</div> <br />
<div style="background-color: #eeeeff">Is a model in which one value of omega applies to every branch satisfactory, or is there enough variation in omega across the tree that it is necessary for each branch to have its own specific omega parameter in order to fit the data well? {{title|each branch needs its own omega|answer}}</div><br />
<div style="background-color: #eeeeff">Does AIC concur with the likelihood ratio test? (Hint: models with smaller values of AIC are preferred over models with larger AIC values.) {{title|yes, the 8492 AIC for the unconstrained model is less than the 8495 AIC for the most constrained model|answer}}</div><br />
<br />
Although you should do the calculation yourself first, you can now have HyPhy perform the likelihood ratio test for you to check your calculations. In the drop-down list box at the top of the Parameters window, choose "Save LF state" and name it "most-constrained". Now, using the same list box, choose "Select as null". Now perform the test by choosing LRT from the same drop-down list box. The results should appear in the HYPHY Console window.<br />
<br />
== Computing the likelihood under a partially-constrained model ==<br />
<br />
Let's try one more model that is intermediate between the unconstrained and most-constrained models you just analyzed. This model will allow for omega to be different in the non-green, parasitic clade compared to the remaining green, non-parasite part of the tree.<br />
<br />
For one of the three branches in the parasite clade (say, the branch leading to PARASITE_A), select the two parameters associated with the branch and click the rightmost button at the top of the Parameters window (this button releases the constraint previously placed on these two parameters). With the two parameters still selected, click the proportionality constraint button again (third from left) and choose the option<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
and specify<br />
omega2<br />
as the name of the New Ratio. Now apply this new ratio to the other two branches in the clade by first releasing the existing constraint and then applying the omega2 constraint.<br />
<br />
Once you are finished, choose ''Likelihood > Optimize'' again to search for the maximum likelihood point. Now choose "Save LF state", naming this one "partially-constrained". Answer the following questions using the values shown in the Parameter window:<br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood under this (partially-constrained) model? {{title|-4221.520105501849|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters were estimated? {{title|24|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega now? {{title|0.02294237571140109|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega2? {{title|0.1183733251322421|answer}}</div><br />
<div style="background-color: #eeeeff">Which is higher: omega or omega2? {{title|omega2|answer}} Does this make sense in light of what you know about the organisms involved and the function of this gene? {{title|yes, selection is expected to be closer to neutral on this gene|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy for this model? {{title|8491.040211003698|answer}}</div><br />
<div style="background-color: #eeeeff">Based on AIC, which of the three models tested thus far would you prefer? {{title|xxxx|answer}}</div><br />
<br />
You can now perform a likelihood ratio test. Using the drop-down list box at the top of the Parameters window, specify the most-constrained model to be the null model and the partially-constrained model to be the alternative. Choose LRT from the drop-down list to perform the test.<br />
<div style="background-color: #eeeeff">Does the partially-constrained model fit the data significantly better than the most-constrained model? {{title|yes, at significance level 0.00963201|answer}}</div><br />
<br />
Perform one more likelihood ratio test, this time using the partially-constrained model as the null and the unconstrained model as the alternative.<br />
<div style="background-color: #eeeeff">Does the unconstrained model fit the data significantly better than the partially-constrained model? {{title|yes, at significance level 0.0102742|answer}}</div><br />
<div style="background-color: #eeeeff">Do AIC and LRT agree on which model of the three models is best? {{title|no, AIC favors the partially-constrained model, while LRT favors the unconstrained model|answer}} Why or why not? {{title|To win using AIC, a model must increase the lnL by 1 for each additional parameter. The unconstrained model increases lnL by 18 but requires 19 parameters more than the partially-constrained model. Note that the LRT favors the unconstrained model but only at the 0.01 significance level|answer}}</div><br />
<br />
<!--<br />
Summary of results:<br />
unconstrained<br />
params: 43<br />
lnL: -4203.47237 (*** best ***)<br />
AIC: 8492.94474 (middle)<br />
<br />
partially-constrained<br />
params: 24<br />
lnL: -4221.52011 (middle)<br />
AIC: 8491.04021 (*** best ***)<br />
<br />
most-constained<br />
params: 23<br />
lnL: -4224.87096 (worst)<br />
AIC: 8495.74193 (worst)<br />
--><br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41251Phylogenetics: HyPhy Lab2020-02-24T14:19:41Z<p>Paul Lewis: /* Loading modules needed */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allow the model of evolution to change across a tree.<br />
<br />
== Obtaining the sequences ==<br />
<br />
A Nexus data file containing sequences and a tree is located here: [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex wickett.nex]. This dataset was assembled by former UConn EEB graduate student [http://www.chicagobotanic.org/research/staff/wickett Norm Wickett] and contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort <em>Aneura mirabilis</em>) that is non-green and does not photosynthesize. Today's lab will recreate the type of analysis Norm carried out in [http://dx.doi.org/10.1007/s00239-008-9133-1 his 2008 paper in Journal of Molecular Evolution (67:111-122)].<br />
<br />
The sequences are of a gene important for photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
<!--<br />
== HyPhy ==<br />
<br />
I have requested that the latest version of [HyPhy http://www.hyphy.org] be installed on the Xanadu cluster, but that hasn't happened yet. I have placed the executable file in the scratch folder. You should create a <tt>bin</tt> directory in your home directory (if you haven't already done so) and copy the hyphy "binary" file there:<br />
cd<br />
mkdir bin<br />
cp /scratch/phylogenetics/hyphy ~/bin<br />
--><br />
<br />
== Loading modules needed ==<br />
<br />
Load the module needed for this exercise:<br />
module load hyphy/2.3.11<br />
<br />
== Loading data into HyPhy ==<br />
<br />
Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose ''File > Open > Open Data File'', then navigate to and select the <tt>wickett.nex</tt> data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the '''Data window''' from this point on.<br />
<br />
== Creating a partition ==<br />
<br />
HyPhy thinks of your data as being composed of one or more '''partitions'''. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene). <br />
<br />
=== The word partition is used in two ways ===<br />
The word partition is ambiguous: it formerly meant "wall" or "divider" but, with the advent of computer hard drives, it has also come to mean the space ''between'' the walls or dividers. When someone says they ''partitioned their data'', they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they ''applied a GTR+I+G model to the rbcL partition'', they have now switched to using the word partition to mean the sites on the rbcL side of the divider.<br />
<br />
=== No partitioning implies one partition! ===<br />
Even if you choose to '''not''' partition (old meaning) your data in HyPhy, you must go through the motions of creating a single partition (new meaning) because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose ''Edit > Select All'' from the Data window menu, then choose ''Data > Selection->Partition'' to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".<br />
<br />
=== Assign a data type to your partition ===<br />
Now that you have a partition, you can create a model for it. Under the column name ''Partition Type'', choose ''codon'' (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is ''Di-nucl.'', which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.<br />
<br />
=== Assign a tree topology to your partition ===<br />
Under Tree Topology, you have several options. Because a tree topology was defined in the <tt>wickett.nex</tt> data file, this tree topology shows up in the drop-down list as <tt>wickett_tree</tt>. Choose <tt>wickett_tree</tt> as the tree topology for your partition.<br />
<br />
=== Assign a substitution model to your partition ===<br />
The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is ''MG94xHKY85_3x4''. This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synonymous and non-synonymous substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.<br />
<br />
The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:<br />
* the frequency of A nucleotides at first positions<br />
* the frequency of G nucleotides at second positions<br />
* the frequency of T nucleotides at third positions<br />
(Note: HyPhy corrects these for the fact that the three stop codons are not included.)<br />
This involves estimating the '''4''' nucleotides frequencies at each of the '''3''' codon positions, hence the '''3x4''' in the name.<br />
<br />
=== Local vs. global ===<br />
You have only a couple more decisions to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. '''Local''' means that HyPhy will estimate some substitution model parameters for every branch in the tree. '''Global''' means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).<br />
<br />
Tell HyPhy to use the Local option (this should already be set correctly).<br />
<br />
=== Equilibrium frequencies===<br />
You should also leave the equilibrium frequencies set to "Partition". This sets the equilibrium base frequencies to the empirical values (i.e. the frequency of A is the number of As observed in the entire partition divided by the total number of nucleotides in the partition). Other options include:<br />
* Dataset, which would not be different than "Partition" in this case where there is only one partition defined, <br />
* Equal, which sets all base frequencies equal to 0.25, and<br />
* Estimate, which estimates the base frequencies<br />
<br />
== Computing the likelihood under a local codon model ==<br />
<br />
You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose ''Likelihood > Build Function'' to build a likelihood function, then ''Likelihood > Optimize'' to optimize the likelihood function (i.e. search for the highest point on the likelihood surface, thus obtaining maximum likelihood estimates of all parameters).<br />
<br />
=== Saving the results ===<br />
When HyPhy has finished optimizing (this will take several seconds to several minutes, depending on the speed of the computer you are using), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the '''Parameters window''') showing you values for all the quantities it estimated. <br />
<br />
Click on the '''HYPHY Console window''' to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions:<br />
<div style="background-color:#eeeeff">What is the maximum log-likelihood under this unconstrained model? {{title|-4203.47237161049|answer}}</div><br />
<div style="background-color:#eeeeff">How many shared (i.e. global) parameters does HyPhy say it estimated? {{title|1|answer}}</div><br />
<div style="background-color:#eeeeff">What are these global parameters? {{title|tree topology|answer}}</div><br />
<div style="background-color:#eeeeff">How many local parameters does HyPhy say it estimated? {{title|42|answer}}</div><br />
<div style="background-color:#eeeeff">What are these local parameters?'' (Hint: for n taxa, there are 2n-3 branches) {{title|synonymous and nonsynonymous rate for each of the 21 edges for 12 taxa|answer}}</div><br />
<br />
Switch back to the Parameters window now and look at the very bottom of the window to answer these questions:<br />
<div style="background-color:#eeeeff">What is the total number of parameters estimated? {{title|43|answer}}</div><br />
<div style="background-color:#eeeeff">What is the value of AIC reported by HyPhy? {{title|8492.944743220985|answer}}</div><br />
<div style="background-color:#eeeeff">Calculate the AIC yourself using this formula: AIC = -2*lnL + 2*nparams {{title|8492.944743221|answer}}</div><br />
<br />
Before moving on, save a snapshot of the likelihood function with the current parameter values by choosing "Save LF state" from the drop-down list box at the top of the Parameters window. Choose the name "unconstrained" when asked. After saving the state of the likelihood function, choose "Select as alternative" from the same drop-down list. This will allow us to easily perform likelihood ratio tests using another, simpler model as the null model.<br />
<br />
=== Viewing the tree and obtaining information about branches ===<br />
The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths). <br />
<br />
The next step is to compare the unconstrained model (in which there are the same number of omega parameters as there are branches) with simpler models involving fewer omega parameters. For example, one model you will use in a few minutes allows the three branches in the parasite clade to evolve under one omega, while all other branches evolve under an omega value that is potentially different. For future reference, you should determine now what name HyPhy is using for the branch leading to the two parasite taxa.<br />
<br />
Click on the branch leading to the two parasites. It should turn into a dotted line. Now double-click this branch and you should get a dialog box popping up with every bit of information known about this branch:<br />
<div style="background-color: #eeeeff">What is the branch id for this branch that leads to the two parasite sequences? {{title|Node10|answer}}</div><br />
You can now close the "Branch Info" dialog box.<br />
<br />
== Computing the likelihood under the most-constrained model ==<br />
Under the current (unconstrained) model, two parameters were estimated for each branch: the synonymous substitution rate and the nonsynonymous substitution rate. Now let's constrain each branch so that the ratio (omega) between the nonsynonymous rate and the synonymous rate is identical for all branches. <br />
<br />
To do this, first notice that each branch is represented by two parameters in the Parameter window. For example, the branch leading to Parasite_A is associated with these two parameters:<br />
wickett_tree.PARASITE_A.nonSynRate<br />
wickett_tree.PARASITE_A.synRate<br />
The goal is to constrain these two parameters so that the nonsynonymous rate is always omega times the synonymous rate, where omega is a new parameter shared by all branches.<br />
<br />
Select the two parameters listed above for the branch leading to PARASITE_A. (You can do this by single-clicking both parameters while simultaneously holding down the Shift key.) Once you have both parameters selected, click on the third button from the left at the top of the Parameters window. This is the button decorated with the symbol for proportionality. Clicking this button will produce a long list of possiblities: here is the one you should choose:<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
Once you select this option, HyPhy will ask for a name: type<br />
omega<br />
as the name of the new ratio.<br />
<br />
Now select the two parameters for a different pair of branches, say <br />
wickett_tree.PARASITE_B.nonSynRate<br />
wickett_tree.PARASITE_B.synRate<br />
Click the proportionality constraint button again, but this time choose<br />
wickett_tree.PARASITE_B.nonSynRate:=omega*wickett_tree.PARASITE_B.synRate<br />
Note that you can choose to use a constraint for other branches once you have defined it for one branch.<br />
<br />
Continue to apply this constraint to all 19 remaining branches. When you are finished, choose ''Likelihood > Optimize'' from the menu at the top of the Parameters window.<br />
<br />
== Performing a model comparison ==<br />
<br />
After HyPhy is finished optimizing the likelihood function, answer the following questions using the numbers at the bottom of the Parameters window:<br />
<div style="background-color: #eeeeff">What is the estimated value of the omega parameter? {{title|0.0247457593714435|answer}}</div><br />
<div style="background-color: #eeeeff">Does this value of omega imply stabilizing selection, neutral evolution or positive selection? {{title|stabilizing selection|answer}}</div><br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood of this (most-constrained) model? {{title|-4224.870964230792|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters are being estimated now? {{title|23|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy? {{title|8495.741928461584|answer}}</div><br />
<div style="background-color: #eeeeff">Does this most-constrained model fit the data better than the unconstrained model? {{title|no|answer}}</div><br />
<div style="background-color: #eeeeff">What is the difference between the log-likelihood of this (most-constrained) model and the log-likelihood of the previous (unconstrained) model? {{title|21.3986|answer}}</div><br />
<div style="background-color: #eeeeff">What is the likelihood ratio test statistic for this comparison? {{title|42.7972|answer}}</div> <br />
<div style="background-color: #eeeeff">How many degrees of freedom does this likelihood ratio test have? {{title|20|answer}}</div><br />
<div style="background-color: #eeeeff">Is the likelihood ratio test significant? (click [http://faculty.vassar.edu/lowry/tabs.html#csq here] for an online chi-square calculator) {{title|yes at significance level 0.00217427|answer}}</div> <br />
<div style="background-color: #eeeeff">Is a model in which one value of omega applies to every branch satisfactory, or is there enough variation in omega across the tree that it is necessary for each branch to have its own specific omega parameter in order to fit the data well? {{title|each branch needs its own omega|answer}}</div><br />
<div style="background-color: #eeeeff">Does AIC concur with the likelihood ratio test? (Hint: models with smaller values of AIC are preferred over models with larger AIC values.) {{title|yes, the 8492 AIC for the unconstrained model is less than the 8495 AIC for the most constrained model|answer}}</div><br />
<br />
Although you should do the calculation yourself first, you can now have HyPhy perform the likelihood ratio test for you to check your calculations. In the drop-down list box at the top of the Parameters window, choose "Save LF state" and name it "most-constrained". Now, using the same list box, choose "Select as null". Now perform the test by choosing LRT from the same drop-down list box. The results should appear in the HYPHY Console window.<br />
<br />
== Computing the likelihood under a partially-constrained model ==<br />
<br />
Let's try one more model that is intermediate between the unconstrained and most-constrained models you just analyzed. This model will allow for omega to be different in the non-green, parasitic clade compared to the remaining green, non-parasite part of the tree.<br />
<br />
For one of the three branches in the parasite clade (say, the branch leading to PARASITE_A), select the two parameters associated with the branch and click the rightmost button at the top of the Parameters window (this button releases the constraint previously placed on these two parameters). With the two parameters still selected, click the proportionality constraint button again (third from left) and choose the option<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
and specify<br />
omega2<br />
as the name of the New Ratio. Now apply this new ratio to the other two branches in the clade by first releasing the existing constraint and then applying the omega2 constraint.<br />
<br />
Once you are finished, choose ''Likelihood > Optimize'' again to search for the maximum likelihood point. Now choose "Save LF state", naming this one "partially-constrained". Answer the following questions using the values shown in the Parameter window:<br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood under this (partially-constrained) model? {{title|-4221.520105501849|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters were estimated? {{title|24|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega now? {{title|0.02294237571140109|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega2? {{title|0.1183733251322421|answer}}</div><br />
<div style="background-color: #eeeeff">Which is higher: omega or omega2? {{title|omega2|answer}} Does this make sense in light of what you know about the organisms involved and the function of this gene? {{title|yes, selection is expected to be closer to neutral on this gene|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy for this model? {{title|8491.040211003698|answer}}</div><br />
<div style="background-color: #eeeeff">Based on AIC, which of the three models tested thus far would you prefer? {{title|xxxx|answer}}</div><br />
<br />
You can now perform a likelihood ratio test. Using the drop-down list box at the top of the Parameters window, specify the most-constrained model to be the null model and the partially-constrained model to be the alternative. Choose LRT from the drop-down list to perform the test.<br />
<div style="background-color: #eeeeff">Does the partially-constrained model fit the data significantly better than the most-constrained model? {{title|yes, at significance level 0.00963201|answer}}</div><br />
<br />
Perform one more likelihood ratio test, this time using the partially-constrained model as the null and the unconstrained model as the alternative.<br />
<div style="background-color: #eeeeff">Does the unconstrained model fit the data significantly better than the partially-constrained model? {{title|yes, at significance level 0.0102742|answer}}</div><br />
<div style="background-color: #eeeeff">Do AIC and LRT agree on which model of the three models is best? {{title|no, AIC favors the partially-constrained model, while LRT favors the unconstrained model|answer}} Why or why not? {{title|To win using AIC, a model must increase the lnL by 1 for each additional parameter. The unconstrained model increases lnL by 18 but requires 19 parameters more than the partially-constrained model. Note that the LRT favors the unconstrained model but only at the 0.01 significance level|answer}}</div><br />
<br />
<!--<br />
Summary of results:<br />
unconstrained<br />
params: 43<br />
lnL: -4203.47237 (*** best ***)<br />
AIC: 8492.94474 (middle)<br />
<br />
partially-constrained<br />
params: 24<br />
lnL: -4221.52011 (middle)<br />
AIC: 8491.04021 (*** best ***)<br />
<br />
most-constained<br />
params: 23<br />
lnL: -4224.87096 (worst)<br />
AIC: 8495.74193 (worst)<br />
--><br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41250Phylogenetics: HyPhy Lab2020-02-24T14:19:22Z<p>Paul Lewis: /* HyPhy */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allow the model of evolution to change across a tree.<br />
<br />
== Obtaining the sequences ==<br />
<br />
A Nexus data file containing sequences and a tree is located here: [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex wickett.nex]. This dataset was assembled by former UConn EEB graduate student [http://www.chicagobotanic.org/research/staff/wickett Norm Wickett] and contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort <em>Aneura mirabilis</em>) that is non-green and does not photosynthesize. Today's lab will recreate the type of analysis Norm carried out in [http://dx.doi.org/10.1007/s00239-008-9133-1 his 2008 paper in Journal of Molecular Evolution (67:111-122)].<br />
<br />
The sequences are of a gene important for photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
<!--<br />
== HyPhy ==<br />
<br />
I have requested that the latest version of [HyPhy http://www.hyphy.org] be installed on the Xanadu cluster, but that hasn't happened yet. I have placed the executable file in the scratch folder. You should create a <tt>bin</tt> directory in your home directory (if you haven't already done so) and copy the hyphy "binary" file there:<br />
cd<br />
mkdir bin<br />
cp /scratch/phylogenetics/hyphy ~/bin<br />
--><br />
<br />
== Loading modules needed ==<br />
<br />
Load the modules needed for this exercise:<br />
module load gcc/6.4.0<br />
<br />
== Loading data into HyPhy ==<br />
<br />
Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose ''File > Open > Open Data File'', then navigate to and select the <tt>wickett.nex</tt> data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the '''Data window''' from this point on.<br />
<br />
== Creating a partition ==<br />
<br />
HyPhy thinks of your data as being composed of one or more '''partitions'''. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene). <br />
<br />
=== The word partition is used in two ways ===<br />
The word partition is ambiguous: it formerly meant "wall" or "divider" but, with the advent of computer hard drives, it has also come to mean the space ''between'' the walls or dividers. When someone says they ''partitioned their data'', they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they ''applied a GTR+I+G model to the rbcL partition'', they have now switched to using the word partition to mean the sites on the rbcL side of the divider.<br />
<br />
=== No partitioning implies one partition! ===<br />
Even if you choose to '''not''' partition (old meaning) your data in HyPhy, you must go through the motions of creating a single partition (new meaning) because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose ''Edit > Select All'' from the Data window menu, then choose ''Data > Selection->Partition'' to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".<br />
<br />
=== Assign a data type to your partition ===<br />
Now that you have a partition, you can create a model for it. Under the column name ''Partition Type'', choose ''codon'' (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is ''Di-nucl.'', which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.<br />
<br />
=== Assign a tree topology to your partition ===<br />
Under Tree Topology, you have several options. Because a tree topology was defined in the <tt>wickett.nex</tt> data file, this tree topology shows up in the drop-down list as <tt>wickett_tree</tt>. Choose <tt>wickett_tree</tt> as the tree topology for your partition.<br />
<br />
=== Assign a substitution model to your partition ===<br />
The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is ''MG94xHKY85_3x4''. This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synonymous and non-synonymous substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.<br />
<br />
The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:<br />
* the frequency of A nucleotides at first positions<br />
* the frequency of G nucleotides at second positions<br />
* the frequency of T nucleotides at third positions<br />
(Note: HyPhy corrects these for the fact that the three stop codons are not included.)<br />
This involves estimating the '''4''' nucleotides frequencies at each of the '''3''' codon positions, hence the '''3x4''' in the name.<br />
<br />
=== Local vs. global ===<br />
You have only a couple more decisions to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. '''Local''' means that HyPhy will estimate some substitution model parameters for every branch in the tree. '''Global''' means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).<br />
<br />
Tell HyPhy to use the Local option (this should already be set correctly).<br />
<br />
=== Equilibrium frequencies===<br />
You should also leave the equilibrium frequencies set to "Partition". This sets the equilibrium base frequencies to the empirical values (i.e. the frequency of A is the number of As observed in the entire partition divided by the total number of nucleotides in the partition). Other options include:<br />
* Dataset, which would not be different than "Partition" in this case where there is only one partition defined, <br />
* Equal, which sets all base frequencies equal to 0.25, and<br />
* Estimate, which estimates the base frequencies<br />
<br />
== Computing the likelihood under a local codon model ==<br />
<br />
You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose ''Likelihood > Build Function'' to build a likelihood function, then ''Likelihood > Optimize'' to optimize the likelihood function (i.e. search for the highest point on the likelihood surface, thus obtaining maximum likelihood estimates of all parameters).<br />
<br />
=== Saving the results ===<br />
When HyPhy has finished optimizing (this will take several seconds to several minutes, depending on the speed of the computer you are using), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the '''Parameters window''') showing you values for all the quantities it estimated. <br />
<br />
Click on the '''HYPHY Console window''' to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions:<br />
<div style="background-color:#eeeeff">What is the maximum log-likelihood under this unconstrained model? {{title|-4203.47237161049|answer}}</div><br />
<div style="background-color:#eeeeff">How many shared (i.e. global) parameters does HyPhy say it estimated? {{title|1|answer}}</div><br />
<div style="background-color:#eeeeff">What are these global parameters? {{title|tree topology|answer}}</div><br />
<div style="background-color:#eeeeff">How many local parameters does HyPhy say it estimated? {{title|42|answer}}</div><br />
<div style="background-color:#eeeeff">What are these local parameters?'' (Hint: for n taxa, there are 2n-3 branches) {{title|synonymous and nonsynonymous rate for each of the 21 edges for 12 taxa|answer}}</div><br />
<br />
Switch back to the Parameters window now and look at the very bottom of the window to answer these questions:<br />
<div style="background-color:#eeeeff">What is the total number of parameters estimated? {{title|43|answer}}</div><br />
<div style="background-color:#eeeeff">What is the value of AIC reported by HyPhy? {{title|8492.944743220985|answer}}</div><br />
<div style="background-color:#eeeeff">Calculate the AIC yourself using this formula: AIC = -2*lnL + 2*nparams {{title|8492.944743221|answer}}</div><br />
<br />
Before moving on, save a snapshot of the likelihood function with the current parameter values by choosing "Save LF state" from the drop-down list box at the top of the Parameters window. Choose the name "unconstrained" when asked. After saving the state of the likelihood function, choose "Select as alternative" from the same drop-down list. This will allow us to easily perform likelihood ratio tests using another, simpler model as the null model.<br />
<br />
=== Viewing the tree and obtaining information about branches ===<br />
The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths). <br />
<br />
The next step is to compare the unconstrained model (in which there are the same number of omega parameters as there are branches) with simpler models involving fewer omega parameters. For example, one model you will use in a few minutes allows the three branches in the parasite clade to evolve under one omega, while all other branches evolve under an omega value that is potentially different. For future reference, you should determine now what name HyPhy is using for the branch leading to the two parasite taxa.<br />
<br />
Click on the branch leading to the two parasites. It should turn into a dotted line. Now double-click this branch and you should get a dialog box popping up with every bit of information known about this branch:<br />
<div style="background-color: #eeeeff">What is the branch id for this branch that leads to the two parasite sequences? {{title|Node10|answer}}</div><br />
You can now close the "Branch Info" dialog box.<br />
<br />
== Computing the likelihood under the most-constrained model ==<br />
Under the current (unconstrained) model, two parameters were estimated for each branch: the synonymous substitution rate and the nonsynonymous substitution rate. Now let's constrain each branch so that the ratio (omega) between the nonsynonymous rate and the synonymous rate is identical for all branches. <br />
<br />
To do this, first notice that each branch is represented by two parameters in the Parameter window. For example, the branch leading to Parasite_A is associated with these two parameters:<br />
wickett_tree.PARASITE_A.nonSynRate<br />
wickett_tree.PARASITE_A.synRate<br />
The goal is to constrain these two parameters so that the nonsynonymous rate is always omega times the synonymous rate, where omega is a new parameter shared by all branches.<br />
<br />
Select the two parameters listed above for the branch leading to PARASITE_A. (You can do this by single-clicking both parameters while simultaneously holding down the Shift key.) Once you have both parameters selected, click on the third button from the left at the top of the Parameters window. This is the button decorated with the symbol for proportionality. Clicking this button will produce a long list of possiblities: here is the one you should choose:<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
Once you select this option, HyPhy will ask for a name: type<br />
omega<br />
as the name of the new ratio.<br />
<br />
Now select the two parameters for a different pair of branches, say <br />
wickett_tree.PARASITE_B.nonSynRate<br />
wickett_tree.PARASITE_B.synRate<br />
Click the proportionality constraint button again, but this time choose<br />
wickett_tree.PARASITE_B.nonSynRate:=omega*wickett_tree.PARASITE_B.synRate<br />
Note that you can choose to use a constraint for other branches once you have defined it for one branch.<br />
<br />
Continue to apply this constraint to all 19 remaining branches. When you are finished, choose ''Likelihood > Optimize'' from the menu at the top of the Parameters window.<br />
<br />
== Performing a model comparison ==<br />
<br />
After HyPhy is finished optimizing the likelihood function, answer the following questions using the numbers at the bottom of the Parameters window:<br />
<div style="background-color: #eeeeff">What is the estimated value of the omega parameter? {{title|0.0247457593714435|answer}}</div><br />
<div style="background-color: #eeeeff">Does this value of omega imply stabilizing selection, neutral evolution or positive selection? {{title|stabilizing selection|answer}}</div><br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood of this (most-constrained) model? {{title|-4224.870964230792|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters are being estimated now? {{title|23|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy? {{title|8495.741928461584|answer}}</div><br />
<div style="background-color: #eeeeff">Does this most-constrained model fit the data better than the unconstrained model? {{title|no|answer}}</div><br />
<div style="background-color: #eeeeff">What is the difference between the log-likelihood of this (most-constrained) model and the log-likelihood of the previous (unconstrained) model? {{title|21.3986|answer}}</div><br />
<div style="background-color: #eeeeff">What is the likelihood ratio test statistic for this comparison? {{title|42.7972|answer}}</div> <br />
<div style="background-color: #eeeeff">How many degrees of freedom does this likelihood ratio test have? {{title|20|answer}}</div><br />
<div style="background-color: #eeeeff">Is the likelihood ratio test significant? (click [http://faculty.vassar.edu/lowry/tabs.html#csq here] for an online chi-square calculator) {{title|yes at significance level 0.00217427|answer}}</div> <br />
<div style="background-color: #eeeeff">Is a model in which one value of omega applies to every branch satisfactory, or is there enough variation in omega across the tree that it is necessary for each branch to have its own specific omega parameter in order to fit the data well? {{title|each branch needs its own omega|answer}}</div><br />
<div style="background-color: #eeeeff">Does AIC concur with the likelihood ratio test? (Hint: models with smaller values of AIC are preferred over models with larger AIC values.) {{title|yes, the 8492 AIC for the unconstrained model is less than the 8495 AIC for the most constrained model|answer}}</div><br />
<br />
Although you should do the calculation yourself first, you can now have HyPhy perform the likelihood ratio test for you to check your calculations. In the drop-down list box at the top of the Parameters window, choose "Save LF state" and name it "most-constrained". Now, using the same list box, choose "Select as null". Now perform the test by choosing LRT from the same drop-down list box. The results should appear in the HYPHY Console window.<br />
<br />
== Computing the likelihood under a partially-constrained model ==<br />
<br />
Let's try one more model that is intermediate between the unconstrained and most-constrained models you just analyzed. This model will allow for omega to be different in the non-green, parasitic clade compared to the remaining green, non-parasite part of the tree.<br />
<br />
For one of the three branches in the parasite clade (say, the branch leading to PARASITE_A), select the two parameters associated with the branch and click the rightmost button at the top of the Parameters window (this button releases the constraint previously placed on these two parameters). With the two parameters still selected, click the proportionality constraint button again (third from left) and choose the option<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
and specify<br />
omega2<br />
as the name of the New Ratio. Now apply this new ratio to the other two branches in the clade by first releasing the existing constraint and then applying the omega2 constraint.<br />
<br />
Once you are finished, choose ''Likelihood > Optimize'' again to search for the maximum likelihood point. Now choose "Save LF state", naming this one "partially-constrained". Answer the following questions using the values shown in the Parameter window:<br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood under this (partially-constrained) model? {{title|-4221.520105501849|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters were estimated? {{title|24|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega now? {{title|0.02294237571140109|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega2? {{title|0.1183733251322421|answer}}</div><br />
<div style="background-color: #eeeeff">Which is higher: omega or omega2? {{title|omega2|answer}} Does this make sense in light of what you know about the organisms involved and the function of this gene? {{title|yes, selection is expected to be closer to neutral on this gene|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy for this model? {{title|8491.040211003698|answer}}</div><br />
<div style="background-color: #eeeeff">Based on AIC, which of the three models tested thus far would you prefer? {{title|xxxx|answer}}</div><br />
<br />
You can now perform a likelihood ratio test. Using the drop-down list box at the top of the Parameters window, specify the most-constrained model to be the null model and the partially-constrained model to be the alternative. Choose LRT from the drop-down list to perform the test.<br />
<div style="background-color: #eeeeff">Does the partially-constrained model fit the data significantly better than the most-constrained model? {{title|yes, at significance level 0.00963201|answer}}</div><br />
<br />
Perform one more likelihood ratio test, this time using the partially-constrained model as the null and the unconstrained model as the alternative.<br />
<div style="background-color: #eeeeff">Does the unconstrained model fit the data significantly better than the partially-constrained model? {{title|yes, at significance level 0.0102742|answer}}</div><br />
<div style="background-color: #eeeeff">Do AIC and LRT agree on which model of the three models is best? {{title|no, AIC favors the partially-constrained model, while LRT favors the unconstrained model|answer}} Why or why not? {{title|To win using AIC, a model must increase the lnL by 1 for each additional parameter. The unconstrained model increases lnL by 18 but requires 19 parameters more than the partially-constrained model. Note that the LRT favors the unconstrained model but only at the 0.01 significance level|answer}}</div><br />
<br />
<!--<br />
Summary of results:<br />
unconstrained<br />
params: 43<br />
lnL: -4203.47237 (*** best ***)<br />
AIC: 8492.94474 (middle)<br />
<br />
partially-constrained<br />
params: 24<br />
lnL: -4221.52011 (middle)<br />
AIC: 8491.04021 (*** best ***)<br />
<br />
most-constained<br />
params: 23<br />
lnL: -4224.87096 (worst)<br />
AIC: 8495.74193 (worst)<br />
--><br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41249Phylogenetics: HyPhy Lab2020-02-24T14:18:59Z<p>Paul Lewis: /* Login to Xanadu */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allow the model of evolution to change across a tree.<br />
<br />
== Obtaining the sequences ==<br />
<br />
A Nexus data file containing sequences and a tree is located here: [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex wickett.nex]. This dataset was assembled by former UConn EEB graduate student [http://www.chicagobotanic.org/research/staff/wickett Norm Wickett] and contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort <em>Aneura mirabilis</em>) that is non-green and does not photosynthesize. Today's lab will recreate the type of analysis Norm carried out in [http://dx.doi.org/10.1007/s00239-008-9133-1 his 2008 paper in Journal of Molecular Evolution (67:111-122)].<br />
<br />
The sequences are of a gene important for photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine as usual:<br />
srun --pty -p mcbstudent --qos=mcbstudent bash<br />
<br />
== HyPhy ==<br />
<br />
I have requested that the latest version of [HyPhy http://www.hyphy.org] be installed on the Xanadu cluster, but that hasn't happened yet. I have placed the executable file in the scratch folder. You should create a <tt>bin</tt> directory in your home directory (if you haven't already done so) and copy the hyphy "binary" file there:<br />
cd<br />
mkdir bin<br />
cp /scratch/phylogenetics/hyphy ~/bin<br />
<br />
== Loading modules needed ==<br />
<br />
Load the modules needed for this exercise:<br />
module load gcc/6.4.0<br />
<br />
== Loading data into HyPhy ==<br />
<br />
Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose ''File > Open > Open Data File'', then navigate to and select the <tt>wickett.nex</tt> data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the '''Data window''' from this point on.<br />
<br />
== Creating a partition ==<br />
<br />
HyPhy thinks of your data as being composed of one or more '''partitions'''. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene). <br />
<br />
=== The word partition is used in two ways ===<br />
The word partition is ambiguous: it formerly meant "wall" or "divider" but, with the advent of computer hard drives, it has also come to mean the space ''between'' the walls or dividers. When someone says they ''partitioned their data'', they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they ''applied a GTR+I+G model to the rbcL partition'', they have now switched to using the word partition to mean the sites on the rbcL side of the divider.<br />
<br />
=== No partitioning implies one partition! ===<br />
Even if you choose to '''not''' partition (old meaning) your data in HyPhy, you must go through the motions of creating a single partition (new meaning) because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose ''Edit > Select All'' from the Data window menu, then choose ''Data > Selection->Partition'' to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".<br />
<br />
=== Assign a data type to your partition ===<br />
Now that you have a partition, you can create a model for it. Under the column name ''Partition Type'', choose ''codon'' (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is ''Di-nucl.'', which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.<br />
<br />
=== Assign a tree topology to your partition ===<br />
Under Tree Topology, you have several options. Because a tree topology was defined in the <tt>wickett.nex</tt> data file, this tree topology shows up in the drop-down list as <tt>wickett_tree</tt>. Choose <tt>wickett_tree</tt> as the tree topology for your partition.<br />
<br />
=== Assign a substitution model to your partition ===<br />
The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is ''MG94xHKY85_3x4''. This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synonymous and non-synonymous substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.<br />
<br />
The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:<br />
* the frequency of A nucleotides at first positions<br />
* the frequency of G nucleotides at second positions<br />
* the frequency of T nucleotides at third positions<br />
(Note: HyPhy corrects these for the fact that the three stop codons are not included.)<br />
This involves estimating the '''4''' nucleotides frequencies at each of the '''3''' codon positions, hence the '''3x4''' in the name.<br />
<br />
=== Local vs. global ===<br />
You have only a couple more decisions to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. '''Local''' means that HyPhy will estimate some substitution model parameters for every branch in the tree. '''Global''' means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).<br />
<br />
Tell HyPhy to use the Local option (this should already be set correctly).<br />
<br />
=== Equilibrium frequencies===<br />
You should also leave the equilibrium frequencies set to "Partition". This sets the equilibrium base frequencies to the empirical values (i.e. the frequency of A is the number of As observed in the entire partition divided by the total number of nucleotides in the partition). Other options include:<br />
* Dataset, which would not be different than "Partition" in this case where there is only one partition defined, <br />
* Equal, which sets all base frequencies equal to 0.25, and<br />
* Estimate, which estimates the base frequencies<br />
<br />
== Computing the likelihood under a local codon model ==<br />
<br />
You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose ''Likelihood > Build Function'' to build a likelihood function, then ''Likelihood > Optimize'' to optimize the likelihood function (i.e. search for the highest point on the likelihood surface, thus obtaining maximum likelihood estimates of all parameters).<br />
<br />
=== Saving the results ===<br />
When HyPhy has finished optimizing (this will take several seconds to several minutes, depending on the speed of the computer you are using), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the '''Parameters window''') showing you values for all the quantities it estimated. <br />
<br />
Click on the '''HYPHY Console window''' to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions:<br />
<div style="background-color:#eeeeff">What is the maximum log-likelihood under this unconstrained model? {{title|-4203.47237161049|answer}}</div><br />
<div style="background-color:#eeeeff">How many shared (i.e. global) parameters does HyPhy say it estimated? {{title|1|answer}}</div><br />
<div style="background-color:#eeeeff">What are these global parameters? {{title|tree topology|answer}}</div><br />
<div style="background-color:#eeeeff">How many local parameters does HyPhy say it estimated? {{title|42|answer}}</div><br />
<div style="background-color:#eeeeff">What are these local parameters?'' (Hint: for n taxa, there are 2n-3 branches) {{title|synonymous and nonsynonymous rate for each of the 21 edges for 12 taxa|answer}}</div><br />
<br />
Switch back to the Parameters window now and look at the very bottom of the window to answer these questions:<br />
<div style="background-color:#eeeeff">What is the total number of parameters estimated? {{title|43|answer}}</div><br />
<div style="background-color:#eeeeff">What is the value of AIC reported by HyPhy? {{title|8492.944743220985|answer}}</div><br />
<div style="background-color:#eeeeff">Calculate the AIC yourself using this formula: AIC = -2*lnL + 2*nparams {{title|8492.944743221|answer}}</div><br />
<br />
Before moving on, save a snapshot of the likelihood function with the current parameter values by choosing "Save LF state" from the drop-down list box at the top of the Parameters window. Choose the name "unconstrained" when asked. After saving the state of the likelihood function, choose "Select as alternative" from the same drop-down list. This will allow us to easily perform likelihood ratio tests using another, simpler model as the null model.<br />
<br />
=== Viewing the tree and obtaining information about branches ===<br />
The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths). <br />
<br />
The next step is to compare the unconstrained model (in which there are the same number of omega parameters as there are branches) with simpler models involving fewer omega parameters. For example, one model you will use in a few minutes allows the three branches in the parasite clade to evolve under one omega, while all other branches evolve under an omega value that is potentially different. For future reference, you should determine now what name HyPhy is using for the branch leading to the two parasite taxa.<br />
<br />
Click on the branch leading to the two parasites. It should turn into a dotted line. Now double-click this branch and you should get a dialog box popping up with every bit of information known about this branch:<br />
<div style="background-color: #eeeeff">What is the branch id for this branch that leads to the two parasite sequences? {{title|Node10|answer}}</div><br />
You can now close the "Branch Info" dialog box.<br />
<br />
== Computing the likelihood under the most-constrained model ==<br />
Under the current (unconstrained) model, two parameters were estimated for each branch: the synonymous substitution rate and the nonsynonymous substitution rate. Now let's constrain each branch so that the ratio (omega) between the nonsynonymous rate and the synonymous rate is identical for all branches. <br />
<br />
To do this, first notice that each branch is represented by two parameters in the Parameter window. For example, the branch leading to Parasite_A is associated with these two parameters:<br />
wickett_tree.PARASITE_A.nonSynRate<br />
wickett_tree.PARASITE_A.synRate<br />
The goal is to constrain these two parameters so that the nonsynonymous rate is always omega times the synonymous rate, where omega is a new parameter shared by all branches.<br />
<br />
Select the two parameters listed above for the branch leading to PARASITE_A. (You can do this by single-clicking both parameters while simultaneously holding down the Shift key.) Once you have both parameters selected, click on the third button from the left at the top of the Parameters window. This is the button decorated with the symbol for proportionality. Clicking this button will produce a long list of possiblities: here is the one you should choose:<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
Once you select this option, HyPhy will ask for a name: type<br />
omega<br />
as the name of the new ratio.<br />
<br />
Now select the two parameters for a different pair of branches, say <br />
wickett_tree.PARASITE_B.nonSynRate<br />
wickett_tree.PARASITE_B.synRate<br />
Click the proportionality constraint button again, but this time choose<br />
wickett_tree.PARASITE_B.nonSynRate:=omega*wickett_tree.PARASITE_B.synRate<br />
Note that you can choose to use a constraint for other branches once you have defined it for one branch.<br />
<br />
Continue to apply this constraint to all 19 remaining branches. When you are finished, choose ''Likelihood > Optimize'' from the menu at the top of the Parameters window.<br />
<br />
== Performing a model comparison ==<br />
<br />
After HyPhy is finished optimizing the likelihood function, answer the following questions using the numbers at the bottom of the Parameters window:<br />
<div style="background-color: #eeeeff">What is the estimated value of the omega parameter? {{title|0.0247457593714435|answer}}</div><br />
<div style="background-color: #eeeeff">Does this value of omega imply stabilizing selection, neutral evolution or positive selection? {{title|stabilizing selection|answer}}</div><br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood of this (most-constrained) model? {{title|-4224.870964230792|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters are being estimated now? {{title|23|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy? {{title|8495.741928461584|answer}}</div><br />
<div style="background-color: #eeeeff">Does this most-constrained model fit the data better than the unconstrained model? {{title|no|answer}}</div><br />
<div style="background-color: #eeeeff">What is the difference between the log-likelihood of this (most-constrained) model and the log-likelihood of the previous (unconstrained) model? {{title|21.3986|answer}}</div><br />
<div style="background-color: #eeeeff">What is the likelihood ratio test statistic for this comparison? {{title|42.7972|answer}}</div> <br />
<div style="background-color: #eeeeff">How many degrees of freedom does this likelihood ratio test have? {{title|20|answer}}</div><br />
<div style="background-color: #eeeeff">Is the likelihood ratio test significant? (click [http://faculty.vassar.edu/lowry/tabs.html#csq here] for an online chi-square calculator) {{title|yes at significance level 0.00217427|answer}}</div> <br />
<div style="background-color: #eeeeff">Is a model in which one value of omega applies to every branch satisfactory, or is there enough variation in omega across the tree that it is necessary for each branch to have its own specific omega parameter in order to fit the data well? {{title|each branch needs its own omega|answer}}</div><br />
<div style="background-color: #eeeeff">Does AIC concur with the likelihood ratio test? (Hint: models with smaller values of AIC are preferred over models with larger AIC values.) {{title|yes, the 8492 AIC for the unconstrained model is less than the 8495 AIC for the most constrained model|answer}}</div><br />
<br />
Although you should do the calculation yourself first, you can now have HyPhy perform the likelihood ratio test for you to check your calculations. In the drop-down list box at the top of the Parameters window, choose "Save LF state" and name it "most-constrained". Now, using the same list box, choose "Select as null". Now perform the test by choosing LRT from the same drop-down list box. The results should appear in the HYPHY Console window.<br />
<br />
== Computing the likelihood under a partially-constrained model ==<br />
<br />
Let's try one more model that is intermediate between the unconstrained and most-constrained models you just analyzed. This model will allow for omega to be different in the non-green, parasitic clade compared to the remaining green, non-parasite part of the tree.<br />
<br />
For one of the three branches in the parasite clade (say, the branch leading to PARASITE_A), select the two parameters associated with the branch and click the rightmost button at the top of the Parameters window (this button releases the constraint previously placed on these two parameters). With the two parameters still selected, click the proportionality constraint button again (third from left) and choose the option<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
and specify<br />
omega2<br />
as the name of the New Ratio. Now apply this new ratio to the other two branches in the clade by first releasing the existing constraint and then applying the omega2 constraint.<br />
<br />
Once you are finished, choose ''Likelihood > Optimize'' again to search for the maximum likelihood point. Now choose "Save LF state", naming this one "partially-constrained". Answer the following questions using the values shown in the Parameter window:<br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood under this (partially-constrained) model? {{title|-4221.520105501849|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters were estimated? {{title|24|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega now? {{title|0.02294237571140109|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega2? {{title|0.1183733251322421|answer}}</div><br />
<div style="background-color: #eeeeff">Which is higher: omega or omega2? {{title|omega2|answer}} Does this make sense in light of what you know about the organisms involved and the function of this gene? {{title|yes, selection is expected to be closer to neutral on this gene|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy for this model? {{title|8491.040211003698|answer}}</div><br />
<div style="background-color: #eeeeff">Based on AIC, which of the three models tested thus far would you prefer? {{title|xxxx|answer}}</div><br />
<br />
You can now perform a likelihood ratio test. Using the drop-down list box at the top of the Parameters window, specify the most-constrained model to be the null model and the partially-constrained model to be the alternative. Choose LRT from the drop-down list to perform the test.<br />
<div style="background-color: #eeeeff">Does the partially-constrained model fit the data significantly better than the most-constrained model? {{title|yes, at significance level 0.00963201|answer}}</div><br />
<br />
Perform one more likelihood ratio test, this time using the partially-constrained model as the null and the unconstrained model as the alternative.<br />
<div style="background-color: #eeeeff">Does the unconstrained model fit the data significantly better than the partially-constrained model? {{title|yes, at significance level 0.0102742|answer}}</div><br />
<div style="background-color: #eeeeff">Do AIC and LRT agree on which model of the three models is best? {{title|no, AIC favors the partially-constrained model, while LRT favors the unconstrained model|answer}} Why or why not? {{title|To win using AIC, a model must increase the lnL by 1 for each additional parameter. The unconstrained model increases lnL by 18 but requires 19 parameters more than the partially-constrained model. Note that the LRT favors the unconstrained model but only at the 0.01 significance level|answer}}</div><br />
<br />
<!--<br />
Summary of results:<br />
unconstrained<br />
params: 43<br />
lnL: -4203.47237 (*** best ***)<br />
AIC: 8492.94474 (middle)<br />
<br />
partially-constrained<br />
params: 24<br />
lnL: -4221.52011 (middle)<br />
AIC: 8491.04021 (*** best ***)<br />
<br />
most-constained<br />
params: 23<br />
lnL: -4224.87096 (worst)<br />
AIC: 8495.74193 (worst)<br />
--><br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_HyPhy_Lab&diff=41246Phylogenetics: HyPhy Lab2020-02-24T03:03:07Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
== Goal ==<br />
<br />
The goal of this lab exercise is to show you how to use the [http://www.hyphy.org/ HyPhy] program for data exploration and hypothesis testing within a maximum likelihood framework. Although much can be done with PAUP* and IQ-TREE, HyPhy lets you to do some interesting and useful things that these programs cannot, such as allow the model of evolution to change across a tree.<br />
<br />
== Obtaining the sequences ==<br />
<br />
A Nexus data file containing sequences and a tree is located here: [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/wickett.nex wickett.nex]. This dataset was assembled by former UConn EEB graduate student [http://www.chicagobotanic.org/research/staff/wickett Norm Wickett] and contains several sequences of bryophytes, including two from a parasitic bryophyte (the liverwort <em>Aneura mirabilis</em>) that is non-green and does not photosynthesize. Today's lab will recreate the type of analysis Norm carried out in [http://dx.doi.org/10.1007/s00239-008-9133-1 his 2008 paper in Journal of Molecular Evolution (67:111-122)].<br />
<br />
The sequences are of a gene important for photosynthesis. The basic idea behind today's lab is to see if we can detect shifts in the evolution of these sequences at the point where these organisms became non-photosynthetic (thus presumably no longer needing genes like this).<br />
<br />
== Login to Xanadu ==<br />
<br />
Login to Xanadu and request a machine that has 4G memory:<br />
srun --pty -p mcbstudent --qos=mcbstudent --mem=4G bash<br />
<br />
== HyPhy ==<br />
<br />
I have requested that the latest version of [HyPhy http://www.hyphy.org] be installed on the Xanadu cluster, but that hasn't happened yet. I have placed the executable file in the scratch folder. You should create a <tt>bin</tt> directory in your home directory (if you haven't already done so) and copy the hyphy "binary" file there:<br />
cd<br />
mkdir bin<br />
cp /scratch/phylogenetics/hyphy ~/bin<br />
<br />
== Loading modules needed ==<br />
<br />
Load the modules needed for this exercise:<br />
module load gcc/6.4.0<br />
<br />
== Loading data into HyPhy ==<br />
<br />
Start HyPhy and dismiss the "Welcome to HyPhy" dialog box (if it appears) by pressing the Ok button. Choose ''File > Open > Open Data File'', then navigate to and select the <tt>wickett.nex</tt> data file that you saved previously. You should now see the sequences appear in a window entitled "DataSet wickett". I will refer to this as the '''Data window''' from this point on.<br />
<br />
== Creating a partition ==<br />
<br />
HyPhy thinks of your data as being composed of one or more '''partitions'''. Partitioning data means assigning characters (sites) into mutually-exclusive groups. For example, suppose your data set comprises two genes: you might want to assign a separate model for each gene, so in this case you would create two partitions (one for each gene). <br />
<br />
=== The word partition is used in two ways ===<br />
The word partition is ambiguous: it formerly meant "wall" or "divider" but, with the advent of computer hard drives, it has also come to mean the space ''between'' the walls or dividers. When someone says they ''partitioned their data'', they mean that they erected dividers, for example between the rbcL and 18S genes. When someone says they ''applied a GTR+I+G model to the rbcL partition'', they have now switched to using the word partition to mean the sites on the rbcL side of the divider.<br />
<br />
=== No partitioning implies one partition! ===<br />
Even if you choose to '''not''' partition (old meaning) your data in HyPhy, you must go through the motions of creating a single partition (new meaning) because HyPhy only allows you to apply a model to a partition. To create a single partition containing all of your sites, choose ''Edit > Select All'' from the Data window menu, then choose ''Data > Selection->Partition'' to assign all the selected sites to a new partition. You should see a line appear below your sequences with a partition name "wickett_part".<br />
<br />
=== Assign a data type to your partition ===<br />
Now that you have a partition, you can create a model for it. Under the column name ''Partition Type'', choose ''codon'' (just press the Ok button in the dialog box that appears). You have now chosen to view your data as codons (i.e. three nucleotides at a time) rather than as single nucleotides. The third possible choice for Partition Type is ''Di-nucl.'', which you would use if you were planning to use a secondary structure (i.e. stem) model, which treats each sequential pair of nucleotides as a state.<br />
<br />
=== Assign a tree topology to your partition ===<br />
Under Tree Topology, you have several options. Because a tree topology was defined in the <tt>wickett.nex</tt> data file, this tree topology shows up in the drop-down list as <tt>wickett_tree</tt>. Choose <tt>wickett_tree</tt> as the tree topology for your partition.<br />
<br />
=== Assign a substitution model to your partition ===<br />
The only substitution models that show up in the drop-down list are codon models because earlier you chose to treat your data as codon sequences rather than nucleotide sequences. The substitution model you should use is ''MG94xHKY85_3x4''. This model is like the Muse and Gaut (1994) codon model, which is the only codon model I discussed in lecture. You will remember (I'm sure) that the MG94 model allows substitutions to be either synonymous or non-synonymous, but does not make a distinction between transitions and transversions. The HKY85 model distinguishes between transitions and transversions (remember kappa?), but does not distinguish between synonymous and non-synonymous substitutions. Thus, MG94xHKY85 is a hybrid model that allows all four possibilities: synonymous transitions, synonymous transversions, nonsynonymous transitions and nonsynonymous transversions. The name is nevertheless a bit puzzling because (as you will find out in a few minutes) it actually behaves more like the GTR model than the HKY model in that it allows all 6 possible types of substitutions (A<->C, A<->G, A<->T, C<->G, C<->T and G<->T) to have their own rates.<br />
<br />
The 3x4 part on the end of the name means that the 61 codon frequencies are obtained by multiplying together the four nucleotide frequencies that are estimated separately for the three codon positions. Thus, the frequency for the AGT codon is obtained by multiplying together these three quantities:<br />
* the frequency of A nucleotides at first positions<br />
* the frequency of G nucleotides at second positions<br />
* the frequency of T nucleotides at third positions<br />
(Note: HyPhy corrects these for the fact that the three stop codons are not included.)<br />
This involves estimating the '''4''' nucleotides frequencies at each of the '''3''' codon positions, hence the '''3x4''' in the name.<br />
<br />
=== Local vs. global ===<br />
You have only a couple more decisions to make before calculating the likelihood. You must choose Local or Global from the Parameters drop-down list. '''Local''' means that HyPhy will estimate some substitution model parameters for every branch in the tree. '''Global''' means that all substitution model parameters will apply to the entire tree. In all the models discussed thus far in the course, we were effectively using the global option except for the branch lengths themselves, which are always local parameters (it doesn't usually make any sense to think of every branch having the same length).<br />
<br />
Tell HyPhy to use the Local option (this should already be set correctly).<br />
<br />
=== Equilibrium frequencies===<br />
You should also leave the equilibrium frequencies set to "Partition". This sets the equilibrium base frequencies to the empirical values (i.e. the frequency of A is the number of As observed in the entire partition divided by the total number of nucleotides in the partition). Other options include:<br />
* Dataset, which would not be different than "Partition" in this case where there is only one partition defined, <br />
* Equal, which sets all base frequencies equal to 0.25, and<br />
* Estimate, which estimates the base frequencies<br />
<br />
== Computing the likelihood under a local codon model ==<br />
<br />
You are now ready to compute the maximum likelihood estimates of the parameters in your model. Choose ''Likelihood > Build Function'' to build a likelihood function, then ''Likelihood > Optimize'' to optimize the likelihood function (i.e. search for the highest point on the likelihood surface, thus obtaining maximum likelihood estimates of all parameters).<br />
<br />
=== Saving the results ===<br />
When HyPhy has finished optimizing (this will take several seconds to several minutes, depending on the speed of the computer you are using), it will pop up a "Likelihood parameters for wickett" window (hereafter I will just refer to this as the '''Parameters window''') showing you values for all the quantities it estimated. <br />
<br />
Click on the '''HYPHY Console window''' to bring it to the foreground, then, using the scroll bar to move up if needed, answer the following questions:<br />
<div style="background-color:#eeeeff">What is the maximum log-likelihood under this unconstrained model? {{title|-4203.47237161049|answer}}</div><br />
<div style="background-color:#eeeeff">How many shared (i.e. global) parameters does HyPhy say it estimated? {{title|1|answer}}</div><br />
<div style="background-color:#eeeeff">What are these global parameters? {{title|tree topology|answer}}</div><br />
<div style="background-color:#eeeeff">How many local parameters does HyPhy say it estimated? {{title|42|answer}}</div><br />
<div style="background-color:#eeeeff">What are these local parameters?'' (Hint: for n taxa, there are 2n-3 branches) {{title|synonymous and nonsynonymous rate for each of the 21 edges for 12 taxa|answer}}</div><br />
<br />
Switch back to the Parameters window now and look at the very bottom of the window to answer these questions:<br />
<div style="background-color:#eeeeff">What is the total number of parameters estimated? {{title|43|answer}}</div><br />
<div style="background-color:#eeeeff">What is the value of AIC reported by HyPhy? {{title|8492.944743220985|answer}}</div><br />
<div style="background-color:#eeeeff">Calculate the AIC yourself using this formula: AIC = -2*lnL + 2*nparams {{title|8492.944743221|answer}}</div><br />
<br />
Before moving on, save a snapshot of the likelihood function with the current parameter values by choosing "Save LF state" from the drop-down list box at the top of the Parameters window. Choose the name "unconstrained" when asked. After saving the state of the likelihood function, choose "Select as alternative" from the same drop-down list. This will allow us to easily perform likelihood ratio tests using another, simpler model as the null model.<br />
<br />
=== Viewing the tree and obtaining information about branches ===<br />
The first item in the Parameters window should be "wickett_tree". Double-click this line to bring up a Tree window showing the tree. You may need to expand the Tree window to see the entire tree. This shows the tree with branch lengths scaled to be proportional to the expected number of substitutions (the normal way to scale branch lengths). <br />
<br />
The next step is to compare the unconstrained model (in which there are the same number of omega parameters as there are branches) with simpler models involving fewer omega parameters. For example, one model you will use in a few minutes allows the three branches in the parasite clade to evolve under one omega, while all other branches evolve under an omega value that is potentially different. For future reference, you should determine now what name HyPhy is using for the branch leading to the two parasite taxa.<br />
<br />
Click on the branch leading to the two parasites. It should turn into a dotted line. Now double-click this branch and you should get a dialog box popping up with every bit of information known about this branch:<br />
<div style="background-color: #eeeeff">What is the branch id for this branch that leads to the two parasite sequences? {{title|Node10|answer}}</div><br />
You can now close the "Branch Info" dialog box.<br />
<br />
== Computing the likelihood under the most-constrained model ==<br />
Under the current (unconstrained) model, two parameters were estimated for each branch: the synonymous substitution rate and the nonsynonymous substitution rate. Now let's constrain each branch so that the ratio (omega) between the nonsynonymous rate and the synonymous rate is identical for all branches. <br />
<br />
To do this, first notice that each branch is represented by two parameters in the Parameter window. For example, the branch leading to Parasite_A is associated with these two parameters:<br />
wickett_tree.PARASITE_A.nonSynRate<br />
wickett_tree.PARASITE_A.synRate<br />
The goal is to constrain these two parameters so that the nonsynonymous rate is always omega times the synonymous rate, where omega is a new parameter shared by all branches.<br />
<br />
Select the two parameters listed above for the branch leading to PARASITE_A. (You can do this by single-clicking both parameters while simultaneously holding down the Shift key.) Once you have both parameters selected, click on the third button from the left at the top of the Parameters window. This is the button decorated with the symbol for proportionality. Clicking this button will produce a long list of possiblities: here is the one you should choose:<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
Once you select this option, HyPhy will ask for a name: type<br />
omega<br />
as the name of the new ratio.<br />
<br />
Now select the two parameters for a different pair of branches, say <br />
wickett_tree.PARASITE_B.nonSynRate<br />
wickett_tree.PARASITE_B.synRate<br />
Click the proportionality constraint button again, but this time choose<br />
wickett_tree.PARASITE_B.nonSynRate:=omega*wickett_tree.PARASITE_B.synRate<br />
Note that you can choose to use a constraint for other branches once you have defined it for one branch.<br />
<br />
Continue to apply this constraint to all 19 remaining branches. When you are finished, choose ''Likelihood > Optimize'' from the menu at the top of the Parameters window.<br />
<br />
== Performing a model comparison ==<br />
<br />
After HyPhy is finished optimizing the likelihood function, answer the following questions using the numbers at the bottom of the Parameters window:<br />
<div style="background-color: #eeeeff">What is the estimated value of the omega parameter? {{title|0.0247457593714435|answer}}</div><br />
<div style="background-color: #eeeeff">Does this value of omega imply stabilizing selection, neutral evolution or positive selection? {{title|stabilizing selection|answer}}</div><br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood of this (most-constrained) model? {{title|-4224.870964230792|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters are being estimated now? {{title|23|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy? {{title|8495.741928461584|answer}}</div><br />
<div style="background-color: #eeeeff">Does this most-constrained model fit the data better than the unconstrained model? {{title|no|answer}}</div><br />
<div style="background-color: #eeeeff">What is the difference between the log-likelihood of this (most-constrained) model and the log-likelihood of the previous (unconstrained) model? {{title|21.3986|answer}}</div><br />
<div style="background-color: #eeeeff">What is the likelihood ratio test statistic for this comparison? {{title|42.7972|answer}}</div> <br />
<div style="background-color: #eeeeff">How many degrees of freedom does this likelihood ratio test have? {{title|20|answer}}</div><br />
<div style="background-color: #eeeeff">Is the likelihood ratio test significant? (click [http://faculty.vassar.edu/lowry/tabs.html#csq here] for an online chi-square calculator) {{title|yes at significance level 0.00217427|answer}}</div> <br />
<div style="background-color: #eeeeff">Is a model in which one value of omega applies to every branch satisfactory, or is there enough variation in omega across the tree that it is necessary for each branch to have its own specific omega parameter in order to fit the data well? {{title|each branch needs its own omega|answer}}</div><br />
<div style="background-color: #eeeeff">Does AIC concur with the likelihood ratio test? (Hint: models with smaller values of AIC are preferred over models with larger AIC values.) {{title|yes, the 8492 AIC for the unconstrained model is less than the 8495 AIC for the most constrained model|answer}}</div><br />
<br />
Although you should do the calculation yourself first, you can now have HyPhy perform the likelihood ratio test for you to check your calculations. In the drop-down list box at the top of the Parameters window, choose "Save LF state" and name it "most-constrained". Now, using the same list box, choose "Select as null". Now perform the test by choosing LRT from the same drop-down list box. The results should appear in the HYPHY Console window.<br />
<br />
== Computing the likelihood under a partially-constrained model ==<br />
<br />
Let's try one more model that is intermediate between the unconstrained and most-constrained models you just analyzed. This model will allow for omega to be different in the non-green, parasitic clade compared to the remaining green, non-parasite part of the tree.<br />
<br />
For one of the three branches in the parasite clade (say, the branch leading to PARASITE_A), select the two parameters associated with the branch and click the rightmost button at the top of the Parameters window (this button releases the constraint previously placed on these two parameters). With the two parameters still selected, click the proportionality constraint button again (third from left) and choose the option<br />
wickett_tree.PARASITE_A.nonSynRate:={New Ratio}*wickett_tree.PARASITE_A.synRate<br />
and specify<br />
omega2<br />
as the name of the New Ratio. Now apply this new ratio to the other two branches in the clade by first releasing the existing constraint and then applying the omega2 constraint.<br />
<br />
Once you are finished, choose ''Likelihood > Optimize'' again to search for the maximum likelihood point. Now choose "Save LF state", naming this one "partially-constrained". Answer the following questions using the values shown in the Parameter window:<br />
<div style="background-color: #eeeeff">What is the maximized log-likelihood under this (partially-constrained) model? {{title|-4221.520105501849|answer}}</div><br />
<div style="background-color: #eeeeff">How many parameters were estimated? {{title|24|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega now? {{title|0.02294237571140109|answer}}</div><br />
<div style="background-color: #eeeeff">What is the value of omega2? {{title|0.1183733251322421|answer}}</div><br />
<div style="background-color: #eeeeff">Which is higher: omega or omega2? {{title|omega2|answer}} Does this make sense in light of what you know about the organisms involved and the function of this gene? {{title|yes, selection is expected to be closer to neutral on this gene|answer}}</div><br />
<div style="background-color: #eeeeff">What is the AIC value reported by HyPhy for this model? {{title|8491.040211003698|answer}}</div><br />
<div style="background-color: #eeeeff">Based on AIC, which of the three models tested thus far would you prefer? {{title|xxxx|answer}}</div><br />
<br />
You can now perform a likelihood ratio test. Using the drop-down list box at the top of the Parameters window, specify the most-constrained model to be the null model and the partially-constrained model to be the alternative. Choose LRT from the drop-down list to perform the test.<br />
<div style="background-color: #eeeeff">Does the partially-constrained model fit the data significantly better than the most-constrained model? {{title|yes, at significance level 0.00963201|answer}}</div><br />
<br />
Perform one more likelihood ratio test, this time using the partially-constrained model as the null and the unconstrained model as the alternative.<br />
<div style="background-color: #eeeeff">Does the unconstrained model fit the data significantly better than the partially-constrained model? {{title|yes, at significance level 0.0102742|answer}}</div><br />
<div style="background-color: #eeeeff">Do AIC and LRT agree on which model of the three models is best? {{title|no, AIC favors the partially-constrained model, while LRT favors the unconstrained model|answer}} Why or why not? {{title|To win using AIC, a model must increase the lnL by 1 for each additional parameter. The unconstrained model increases lnL by 18 but requires 19 parameters more than the partially-constrained model. Note that the LRT favors the unconstrained model but only at the 0.01 significance level|answer}}</div><br />
<br />
<!--<br />
Summary of results:<br />
unconstrained<br />
params: 43<br />
lnL: -4203.47237 (*** best ***)<br />
AIC: 8492.94474 (middle)<br />
<br />
partially-constrained<br />
params: 24<br />
lnL: -4221.52011 (middle)<br />
AIC: 8491.04021 (*** best ***)<br />
<br />
most-constained<br />
params: 23<br />
lnL: -4224.87096 (worst)<br />
AIC: 8495.74193 (worst)<br />
--><br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41219Phylogenetics: Simulating sequence data2020-02-17T15:38:58Z<p>Paul Lewis: /* Strimmer and Rambaut (2002) Study */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data to a file. You will also probably want to specify only 1 sequence length (e.g. <tt>nchar=1000</tt>) and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to interleave the sequences when it exports. Do this as a parameter of the <tt>export</tt> function. You can use PAUP*'s <tt>cstatus</tt> command to help answer the questions about proportion of constant sites.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites? How many substitutions are simulated, on average, per site over the entire tree?''<br />
* ''Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites? How many substitutions are simulated, on average, per site over the entire tree?''<br />
* ''Make all branches in the true tree 10 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now? How many substitutions are simulated, on average, per site over the entire tree? To which of the previous 2 simulated data sets is this data set most similar? Can you explain why?''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 500 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1, and you need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
set maxtrees=105;<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest autest RELL bootreps=1000;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command, and you can delete the <tt>resultsfile</tt> statement.<br />
<br />
(Note: look at the column labeled <tt>SH</tt>, not the column labeled <tt>wtd-SH</tt>.)<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 500 sites? 5000 sites?''<br />
* ''Does the AU test produce a different result?''<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41218Phylogenetics: Simulating sequence data2020-02-17T14:56:12Z<p>Paul Lewis: /* Strimmer and Rambaut (2002) Study */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data to a file. You will also probably want to specify only 1 sequence length (e.g. <tt>nchar=1000</tt>) and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to interleave the sequences when it exports. Do this as a parameter of the <tt>export</tt> function. You can use PAUP*'s <tt>cstatus</tt> command to help answer the questions about proportion of constant sites.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites? How many substitutions are simulated, on average, per site over the entire tree?''<br />
* ''Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites? How many substitutions are simulated, on average, per site over the entire tree?''<br />
* ''Make all branches in the true tree 10 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now? How many substitutions are simulated, on average, per site over the entire tree? To which of the previous 2 simulated data sets is this data set most similar? Can you explain why?''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1, and you need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41217Phylogenetics: Simulating sequence data2020-02-17T14:37:27Z<p>Paul Lewis: /* Saving Simulated Data */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data to a file. You will also probably want to specify only 1 sequence length (e.g. <tt>nchar=1000</tt>) and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to interleave the sequences when it exports. Do this as a parameter of the <tt>export</tt> function. You can use PAUP*'s <tt>cstatus</tt> command to help answer the questions about proportion of constant sites.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites? How many substitutions are simulated, on average, per site over the entire tree?''<br />
* ''Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites? How many substitutions are simulated, on average, per site over the entire tree?''<br />
* ''Make all branches in the true tree 10 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now? How many substitutions are simulated, on average, per site over the entire tree? To which of the previous 2 simulated data sets is this data set most similar? Can you explain why?''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41216Phylogenetics: Simulating sequence data2020-02-17T14:33:39Z<p>Paul Lewis: /* Saving Simulated Data */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data to a file. You will also probably want to specify only 1 sequence length (e.g. <tt>nchar=1000</tt>) and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to interleave the sequences when it exports. Do this as a parameter of the <tt>export</tt> function. You can use PAUP*'s <tt>cstatus</tt> command to help answer the questions about proportion of constant sites.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites? How many substitutions are simulated, on average, for each site over the entire tree?''<br />
* ''Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites? How many substitutions are simulated, on average, for each site over the entire tree?''<br />
* ''Make all branches in the true tree 10 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now? How many substitutions are simulated, on average, for each site over the entire tree? To which of the previous 2 simulated data sets is this data set most similar? Can you explain why?''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41215Phylogenetics: Simulating sequence data2020-02-17T14:26:20Z<p>Paul Lewis: /* Saving Simulated Data */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data to a file. You will also probably want to specify only 1 sequence length (e.g. <tt>nchar=1000</tt>) and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to interleave the sequences when it exports. Do this as a parameter of the <tt>export</tt> function. You can use PAUP*'s <tt>cstatus</tt> command to help answer the questions about proportion of constant sites.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites?''<br />
* ''Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites?''<br />
* ''Make all branches in the true tree 10 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now? To which of the previous 2 simulated data sets is this data set most similar? Can you explain why?''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41213Phylogenetics: Simulating sequence data2020-02-17T02:46:58Z<p>Paul Lewis: /* Saving Simulated Data */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you increase the number of taxa in your tree the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites?''<br />
* ''Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites?''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now? Is this what you expected?''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41212Phylogenetics: Simulating sequence data2020-02-17T02:42:14Z<p>Paul Lewis: /* Saving Simulated Data */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you increase the number of taxa in your tree the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41211Phylogenetics: Simulating sequence data2020-02-17T02:41:59Z<p>Paul Lewis: /* Saving Simulated Data */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you increase the number of taxa in your tree anthe effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41210Phylogenetics: Simulating sequence data2020-02-17T02:24:19Z<p>Paul Lewis: /* Simulation Template */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41209Phylogenetics: Simulating sequence data2020-02-17T02:23:53Z<p>Paul Lewis: /* Execute the NEXUS File */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the <tt>results.txt</tt> file directly in your terminal by typing <tt>column -t results.txt | less</tt> (The <tt>-t</tt> makes the columns align, and the pipe to <tt>less</tt> causes the output to pause after each page of output is shown. Type <tt>q</tt> to get out once you've reached the bottom.).<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41208Phylogenetics: Simulating sequence data2020-02-17T02:20:33Z<p>Paul Lewis: /* Simulation Template */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. <br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41207Phylogenetics: Simulating sequence data2020-02-17T02:19:42Z<p>Paul Lewis: /* Execute the NEXUS File */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>srun --partition=mcbstudent --qos=mcbstudent --pty bash</tt> session, and load the current module of PAUP*. (Use <tt>module avail</tt> to see available modules, then use <tt>module load xxxx</tt> to start module <tt>xxxx</tt>.) Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41206Phylogenetics: Simulating sequence data2020-02-17T02:13:55Z<p>Paul Lewis: /* Simulation Template */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41205Phylogenetics: Simulating sequence data2020-02-17T02:09:33Z<p>Paul Lewis: /* Simulation Template */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
set nowarntsave;<br />
quit;<br />
end;<br />
The initial <tt>paup</tt> block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.<br />
* ''What is the reason for including <tt>storebrlens</tt>?''<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41204Phylogenetics: Simulating sequence data2020-02-17T02:07:35Z<p>Paul Lewis: /* Simulation Template */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin paup;<br />
set storebrlens nowarntsave;<br />
end;<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
quit;<br />
end;<br />
The initial <tt>paup</tt> block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.<br />
* ''What is the reason for including <tt>storebrlens</tt>?''<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41203Phylogenetics: Simulating sequence data2020-02-17T01:56:30Z<p>Paul Lewis: /* Enter the Felsenstein Zone */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin paup;<br />
set storebrlens nowarntsave;<br />
end;<br />
<br />
begin taxa;<br />
dimensions ntax=4;<br />
taxlabels A B C D;<br />
end;<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
quit;<br />
end;<br />
The initial <tt>paup</tt> block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.<br />
* ''What is the reason for including <tt>storebrlens</tt>?''<br />
The <tt>taxa</tt> block contains the number of taxa and the names of the taxa.<br />
* ''Why is this taxa block necessary?''<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? <br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' <br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41202Phylogenetics: Simulating sequence data2020-02-17T01:56:01Z<p>Paul Lewis: /* Execute the NEXUS File */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin paup;<br />
set storebrlens nowarntsave;<br />
end;<br />
<br />
begin taxa;<br />
dimensions ntax=4;<br />
taxlabels A B C D;<br />
end;<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
quit;<br />
end;<br />
The initial <tt>paup</tt> block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.<br />
* ''What is the reason for including <tt>storebrlens</tt>?''<br />
The <tt>taxa</tt> block contains the number of taxa and the names of the taxa.<br />
* ''Why is this taxa block necessary?''<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?''<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? {{title|Nope|answer}}<br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' {{title|ML because it infers the true tree more often as the amount of data increases|answer}}<br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41201Phylogenetics: Simulating sequence data2020-02-17T01:36:53Z<p>Paul Lewis: /* Strimmer and Rambaut (2002) Study */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin paup;<br />
set storebrlens nowarntsave;<br />
end;<br />
<br />
begin taxa;<br />
dimensions ntax=4;<br />
taxlabels A B C D;<br />
end;<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
quit;<br />
end;<br />
The initial <tt>paup</tt> block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.<br />
* ''What is the reason for including <tt>storebrlens</tt>?''<br />
The <tt>taxa</tt> block contains the number of taxa and the names of the taxa.<br />
* ''Why is this taxa block necessary?''<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?'' {{title|Too easy!|answer}}<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? {{title|Nope|answer}}<br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' {{title|ML because it infers the true tree more often as the amount of data increases|answer}}<br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.<br />
<br />
==Literature Cited==<br />
H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.<br />
<br />
H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41200Phylogenetics: Simulating sequence data2020-02-17T01:29:17Z<p>Paul Lewis: /* Strimmer and Rambaut (2002) Study */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin paup;<br />
set storebrlens nowarntsave;<br />
end;<br />
<br />
begin taxa;<br />
dimensions ntax=4;<br />
taxlabels A B C D;<br />
end;<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
quit;<br />
end;<br />
The initial <tt>paup</tt> block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.<br />
* ''What is the reason for including <tt>storebrlens</tt>?''<br />
The <tt>taxa</tt> block contains the number of taxa and the names of the taxa.<br />
* ''Why is this taxa block necessary?''<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?'' {{title|Too easy!|answer}}<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? {{title|Nope|answer}}<br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' {{title|ML because it infers the true tree more often as the amount of data increases|answer}}<br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Simulating_sequence_data&diff=41199Phylogenetics: Simulating sequence data2020-02-17T01:19:48Z<p>Paul Lewis: /* Strimmer and Rambaut (2002) Study */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan and Paul Lewis<br />
<br />
== Goals ==<br />
<br />
The goal of this lab is to gain experience simulating DNA sequence data using PAUP*, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. PAUP* is a relatively new way to simulate sequence data. The old workhorse for DNA simulations in phylogenetics is Andrew Rambaut's program seq-gen, which is [http://tree.bio.ed.ac.uk/software/seqgen/ still available] (and still as useful as it always was).<br />
<br />
== Introduction ==<br />
<br />
The development of models and algorithms of any kind requires testing to see how they perform. All models and algorithms make assumptions: they take the infinite complexity of nature and distill them into few components that the maker of the model/algorithm assumes are important. With models of DNA evolution and phylogenetic inference algorithms, one important way of testing the capability of a model/algorithm is by simulating DNA sequence data based on a known phylogeny, and seeing how the model/algorithm performs. If the model/algorithm allows for the recovery of the known or "true" phylogeny then we can rest assured that our model/algorithm is relatively accurate in its distillation of the complexity of the processes it attempts to capture.<br />
<br />
== Getting Started ==<br />
<br />
We will be using cutting-edge features in PAUP* -- so cutting edge that you will not be able to find any information about these features anywhere online or by using the <tt>help</tt> command in PAUP*! So don't get confused when you try to look up some of the components of the NEXUS file you will be using. There are some familiar blocks and commands in the NEXUS file though. Feel free to look at past labs or use the <tt>help</tt> command to refresh your memory. <br />
<br />
====Simulation Template====<br />
<br />
Create an empty text file and add the following lines to it and save it as a .nex file:<br />
#nexus<br />
<br />
begin paup;<br />
set storebrlens nowarntsave;<br />
end;<br />
<br />
begin taxa;<br />
dimensions ntax=4;<br />
taxlabels A B C D;<br />
end;<br />
<br />
begin trees;<br />
tree 1 = [&R] ((A:0.1,B:0.1):0.05,(C:0.1,D:0.1):0.05);<br />
end;<br />
<br />
begin dnasim;<br />
simdata nchar=10000;<br />
lset nst=1 basefreq=equal rates=equal pinvar=0;<br />
truetree source=memory treenum=1 showtruetree=brlens;<br />
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);<br />
[parsimony]<br />
set criterion=parsimony;<br />
alltrees;<br />
tally parsimony;<br />
[likelihood under JC]<br />
set criterion=likelihood;<br />
alltrees;<br />
tally 'ML-JC';<br />
endsim; <br />
end;<br />
<br />
begin paup;<br />
quit;<br />
end;<br />
The initial <tt>paup</tt> block tells PAUP* to store branch lengths of any tree it encounters and to not warn us if there are trees in memory when we try to quit.<br />
* ''What is the reason for including <tt>storebrlens</tt>?''<br />
The <tt>taxa</tt> block contains the number of taxa and the names of the taxa.<br />
* ''Why is this taxa block necessary?''<br />
The <tt>trees</tt> block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted, but the obscure <tt>[&R]</tt> says that this tree is rooted.<br />
<br />
The <tt>beginsim...endsim</tt> loop in the <tt>dnasim</tt> block tells PAUP* to simulate 100 nucleotide data sets using the Jukes-Cantor model with no rate heterogeneity. For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. The <tt>tally</tt> commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file <tt>results.txt</tt>, which is best viewed by pasting its contents into an Excel worksheet. You can also view the results directly in your terminal by typing <tt>column -t results.txt | less</tt><br />
<br />
For both parsimony and ML, tally calculates the following quantities (where TALLYLABEL is either "parsimony" or "ML-JC"):<br />
* TALLYLABEL_Ntrees, the number of trees tied for being best (ideally 1)<br />
* TALLYLABEL_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)<br />
* TALLYLABEL_correct, same as TALLYLABEL_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)<br />
<br />
====Execute the NEXUS File====<br />
<br />
Log on to the cluster, start a <tt>qlogin</tt> session, and load the current module of PAUP*. If you forget how to load modules, notice the message that was printed right after you logged on to the cluster. Start PAUP*, and execute your NEXUS file. <br />
<br />
Note that PAUP* quits after performing the simulations (because we told it to in that final paup block). Open the <tt>results.txt</tt> file in a spreadsheet viewer to see the results.<br />
<br />
* ''Did both optimality criteria get the tree correct most of the time?'' {{title|Too easy!|answer}}<br />
<br />
==Enter the Felsenstein Zone==<br />
<br />
As you've learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named <tt>paupsimFZ.nex</tt>. Edit the new file and change two edge lengths to 1.0 in order to create a true tree in the Felsenstein zone.<br />
<br />
Execute <tt>paupsimFZ.nex</tt>, then open the new results.txt file in a spreadsheet viewer.<br />
<br />
* ''Did both of optimality criteria produce the same tree''? {{title|Nope|answer}}<br />
<br />
Change the <tt>simdata nchar=10000;</tt> line to <tt>simdata nchar=(10 100 1000 10000);</tt> and change <tt>output=allreps</tt> to <tt>output=meansonly</tt>. Now PAUP* will simulate data sets of four different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.<br />
<br />
* ''Which (parsimony or ML) appears to be statistically consistent? Why?'' {{title|ML because it infers the true tree more often as the amount of data increases|answer}}<br />
<br />
Add substantial rate heterogeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate homogeneity).<br />
<br />
* ''Is ML statistically consistent when the model is violated in this way? Why?''<br />
<br />
==Saving Simulated Data==<br />
<br />
Can you figure out how to change your NEXUS file so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use the "help" command to figure out how to export data. You will also probably want to specify only 1 sequence length and 1 simulation replicate. It's a bit easier to comprehend if you tell PAUP* to <tt>interleave</tt> the sequence when it displays it. Do this as a parameter of the <tt>export</tt> function. If you add taxa to your tree and taxa block the effects of the the following changes will be more pronounced.<br />
<br />
* ''Make all branches in the true tree long (e.g. 10) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree short (e.g. 0.001) and see if the simulated data is what you expect''<br />
* ''Make all branches in the true tree 0.1 but add significant rate heterogeneity (gamma shape 0.01) and see if the simulated data is what you expect''<br />
<br />
==Strimmer and Rambaut (2002) Study==<br />
<br />
Download this paper, which I'll refer to as SR from now on:<br />
<br />
Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142. [https://doi.org/10.1098/rspb.2001.1862 https://doi.org/10.1098/rspb.2001.1862]<br />
<br />
SR simulated data on the tree shown in Figure 1 of their paper and expected the SH test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn't it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result prompted Shimodaira to create the AU (Approximately Unbiased) test.<br />
<br />
Can you recreate SR's results for 1000 and 5000 sites (see their Table 3)? <br />
<br />
To do this, you will need to make your true tree equal the tree SR show in their Figure 1 (and modify the taxa block accordingly), and you may need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete the <tt>set criterion=...</tt>, <tt>alltrees</tt>, and <tt>tally</tt> commands inside the <tt>beginsim...endsim</tt> loop, replacing these with<br />
generatetrees all model=equiprobable;<br />
lscores all / shtest RELL;<br />
which generates all 105 possible trees and tests them all using the SH test. To see the output, you'll need to say <tt>monitor=yes</tt> in your <tt>beginsim</tt> command.<br />
<br />
* ''How many of the 105 trees were not significant using the SH test for 1000 sites? 5000 sites?''<br />
<br />
Change to the AU test and see if that produces a different result.</div>Paul Lewis