Phylogenetics: Distances Lab

From EEBedia
Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
The goal of this lab exercise is to show you how to conduct various distance based analyses in PAUP* and SplitsTree

Part A: Using PAUP* to check your answers for homework #3

Open/recreate the data file

Start by checking to see if you still have the angio35.nex file you used last week. If you did not save this file, go back to last week's lab and recreate it.

Create a PAUP* command file

Use File > New in PAUP* to create a new file, create a PAUP* block with the following commands and save the file as run.nex:


begin paup;
  execute angio35.nex;
  dset dist=abs;
  delete 5-.;
  exclude missambig;  

The file run.nex tells PAUP* to:

  • use the absolute number of nucleotide differences between taxa as the distance measure (dset dist=abs);
  • delete all taxa except the first four (delete 5-.);
  • exclude all sites that have missing or ambiguous data (exclude missambig); and
  • show the distance matrix (showdist)

Execute this file in PAUP*, and examine the output.

  • Is the distance matrix output the same as the one you were given for homework #3?
  • Is the number of characters included the same as the total number of nucleotides given in homework #3?

Calculate JC distances

The next step in homework #3 was to compute the Jukes-Cantor distances. To do this in PAUP*, just change the distance measure from abs to jc and re-execute run.nex.

  • Do the JC distances that PAUP* reports match the ones you calculated by hand?
  • You have seen two distance measures that PAUP* can calculate, but how could you get a list of all the distance measures it can compute (hint: take a look back at the description of the "? option" in the PAUP* tips section of the first lab)?

Estimate edge lengths using weighted least squares

Next, perform a search using weighted least squares (weighted usually implies that the power is 2 in the denominator of the sum-of-squares formula). Add the following line to your run.nex file, just below the execute angio35.nex; line:

set criterion=distance autoclose;

This command tells PAUP* to use the distance optimality criterion specified by the objective option of the dset command during the search. If you were to leave this out, PAUP* would use the default optimality criterion (parsimony). The autoclose option tells PAUP* that it is ok to automatically close the search window when finished (saves having to press the "Close" button). Now issue the command dset ? in PAUP* to get a listing of the current values of all distance settings.

  • What is the current setting for the power option?

If your answer is not 2, then add this line to your paup block, below the execute command:

dset power=2;
  • What is the current setting for the objective option?

If your answer is not lsfit, then add this line to your paup block, also below the execute command:

dset objective=lsfit;

Finally, add the following two lines to the end of your paup block to perform the search and show the resulting best tree, including the estimated branch lengths:

describe 1 / brlens;

When you've saved the new version of run.nex, use Ctrl-r to re-execute it.

  • Are the branch lengths reported by PAUP* the same as those in "tree 2" given to you in the homework #3 assignment? (Note: just pay attention to the taxon numbers in parentheses. The names of the taxa are different: e.g. Ephedrasinica (1) = taxon1 in the homework assignment.)
  • Look for the line that begins "Weighted sum-of-squares =": does the value reported by PAUP* agree with your hand calculations?
  • In its output, PAUP* also gives you the score that would have been used were we using the minimum evolution criterion. Can you find it? Ask for help if you need to; PAUP* does not make this obvious. Write the value down for comparison with the value you will obtain in the next section.

Searching under the minimum evolution criterion

Before moving on to the next exercise, repeat the above search using the minimum evolution (or ME) criterion. To do this, you will need to add the objective=ME option to your dset command (be sure to remove the previous dset objective=lsfit if you had to put it in) and re-execute run.nex.

  • Is the result what you expected based on your answer to the last question in the previous section? (Note: it is ok if the results differ slightly in the 5th. decimal place.)


Starting (almost) from scratch, you have, by hand and also using PAUP*, computed evolutionary distances using the Jukes-Cantor model and evaluated a tree using the weighted least squares criterion!

Part B: Analysis of algae.nex

Create the data file algae.nex

Click here to get the data, then use Ctrl-a to copy all of it from your browser window and save it to a file named algae.nex in a folder on your local hard drive. These data were originally used in a 1994 study by Lockhart et al.[1] and comprise eight 16S ribosomal RNA sequences:

Anacystis a cyanobacterium (has chlorophyll a but not b or c)
Olithodiscus a chloroplast from a chromophyte alga (chlorophylls a and c)
Euglena a chloroplast from a photosynthetic euglenophyte protist (chlorophylls a and b)
Chlorella a chloroplast from a a chlorophyte green alga (chlorophylls a and b)
Chlamydomonas a chloroplast from a chlorophyte green alga (chlorophylls a and b)
Marchantia a chloroplast from a thallose liverwort (chlorophylls a and b)
Oryza a chloroplast from a monocot (chlorophylls a and b)
Nicotiana a chloroplast from a dicot (chlorophylls a and b)

All of these organisms except Anacystis and Olithodiscus have chlorophylls a and b. Based on independent evidence, it is probable that all chlorophyll a/b-containing chloroplasts have a common endosymbiotic origin, so we would expect trees constructed from these data to show a branch separating Anacystis and Olithodiscus from everything else. The cyanobacterium Anacystis uses phycobilin accessory pigments rather than chlorophylls for photosynthesis, and the chromophyte alga Olithodiscus has chlorophylls a and c (but not b).

Create a PAUP* command file.

Choose File > New from the main menu in PAUP* to create a blank text file. Copy the following text and paste it into the new text file window, then save this file as runalgae.nex, placing it in the same directory as algae.nex:

begin paup;
  log file=lab3.txt start replace;
  set torder=right autoclose;
  execute algae.nex;
  outgroup Anacystis_nidulans;
  set criterion=distance;
  ** JC Model, ME Criterion **
  dset distance=jc objective=me;
  describe 1;
  [other commands here]
  log stop;

During the course of this lab, you can simply add commands to this paup block rather than creating a paup block in the data file itself. The set torder=right command simply causes trees to be displayed so that the outgroup is at the top and the tree appears to flow to the right (this is often called ladderizing right). Try changing this to set torder=left to see what ladderizing left looks like. The set autoclose command just makes the analysis run more quickly because PAUP* closes the progress dialog box automatically after a search is done rather than requiring you to press the "Close" button.

Perform an exhaustive search under the minimum evolution and least squares criteria

Use both the JC and logdet models combined with both Minimum Evolution and Least Squares (using the default power of 2) optimality criteria. Use dset dist=? to find out how to specify the model, and use dset objective=? to find out how to specify which of the two distance-based optimality criteria to use. Set up all 4 analyses (2 models times 2 optimality criteria) in the paup block (the first one has already been done for you), then run them all at once by executing the file. Note the printable comment (it starts with an exclamation point). Making comments like this in your paup block will allow you to easily find where the results from each model start in the output.

  • Which of the 4 analyses produced an estimated tree placing the chlorophyll a/b clade together?

The main point here is that the model used can make a difference in whether you get the correct answer. The logdet model has not yet been discussed: a bit more background on models is needed before an explanation of the logdet model will make sense. For now, just know that the logdet model is one of the few models in common use that is not tricked by nucleotide composition that changes across the tree. Models like JC that assume nucleotide composition is constant tend to group taxa with similar nucleotide composition (e.g. GC-rich sequences) even if they are not closest relatives. Logdet is not so easily fooled in such circumstances, but can be tricked by other features of sequence evolution (such as among-site rate heterogeneity).

Perform a bootstrap analysis

Bootstrapping (introduced to phylogenetics by Joe Felsenstein in 1985[2]) is one of the most common methods for assessing which parts (i.e. edges) of an estimated tree are best supported and which are poorly supported by the data. In bootstrapping, many new datasets are created by sampling (with replacement) characters from the original dataset. A tree is estimated from each bootstrap replicate dataset, and at the end a consensus tree is produced that has in it all splits showing up in a majority of the bootstrap trees. The idea is that the original dataset can be thought of as one sample of characters from a vast pool of characters, and producing other data sets by resampling comes as close as possible to collecting data from other genes similar to the one for which you did collect data. The majority-rule consensus tree does not include splits that were present in fewer than half of the trees estimated from bootstrap replicate data sets, but the bipartition table that is produced by PAUP* provides the relative frequency of these less common splits.

Perform a bootstrap analysis (500 replicates, heuristic search using least squares) under the F84 model. The commands for doing this are shown below. I suggest you copy these into a new paup block in your runalgae.nex file. Be sure to disable your previous paup block by renaming it; if you change the name of a paup block slightly, for example, changing begin paup; to begin _paup, PAUP* will skip it when executing the file.
dset distance=f84 objective=lsfit power=2;
bootstrap nrep=500 search=heuristic bseed=5555;
  • Examine the bipartition table PAUP* produces and locate the bootstrap proportion for the bipartition separating the chlorophyll a/b organisms from Anacystis and Olithodiscus
  • What is the bootstrap proportion for the bipartition separating Olithodiscus and Euglena from all the other taxa?

Sometimes a split will just barely fail to make the cut (e.g. it appeared in 49% of the bootstrap trees), and so it is a good idea to check the bipartition table to make sure you don't fail to notice these. In this case, the F84 model behaves much like JC, putting the chlorophyll a/c containing Olithodiscus inside the chlorophyll a/b clade. You should have found that the bootstrap support for the split separating the chlorophyll a/b clade from the remaining two taxa is 31.4% (i.e. not very close to the 50% mark required for inclusion in the majority-rule consensus tree).

Compare NJ with a comparable heuristic search

In this exercise, you will conduct a Neighbor-joining (NJ) analysis using HKY distances and compare this with an heuristic search using the minimum evolution criterion. The goal of this section is to demonstrate that it is possible for heuristic searching to find a better tree than NJ, even using the same optimality criterion.

Please quit PAUP* and start it again. The reason for this will be made clear later, but mainly the purpose is to return all settings to their default values.

Put the commands below in a paup block in a new file. Note that we are again using the angio35.nex file:

execute angio35.nex;
dset distance=hky objective=me;
dscores 1;
hsearch start=1;
dscores 1;
  • What is the minimum evolution score for the NJ tree? (scroll down from the beginning of the PAUP* output looking for the phrase "ME-score" right after point where the NJ tree is displayed)
  • What is the minimum evolution score for the tree found by heuristic search starting with the NJ tree?
  • What is wrong with this picture? Why is the minimum evolution score of the heuristic search worse than that of the starting tree? (Hint: take a look at the "Heuristic search settings" section of the output.)

Once you have figured out what is going on (ask us for help if you are stumped), fix your paup block and re-execute the file.

In your reanalysis, you should find that the heuristic search starting with the NJ tree found a better tree using the same optimality criterion (minimum evolution) being used by NJ. Neighbor-joining is very popular, but you should be wary of NJ trees involving large numbers of taxa. This analysis involved 35 taxa; for problem of this size or larger, it is almost certain that NJ will not find the best tree.

Using SplitsTree

We will now abandon PAUP* and explore a program called SplitsTree for a moment. SplitsTree has been around for several years, but the version we are using (SplitsTree 4) is still in the beta test stage, so just be aware that you will notice a few glitches. If you need to use SplitsTree for a paper, and version 4 does not do the job for you, you can still download older versions (e.g. 3.2) that are much more stable at the SplitsTree home page.

SplitsTree is the primary program used to perform split decomposition, which I discussed in this lecture. If there is no shortcut for SplitsTree on your desktop, you can make one by going to Start > My Computer > C: drive > Program Files > SplitsTree and dragging (with the right mouse button) the program icon to your desktop. When you release the right mouse button, "Create shortcut here" should be one of the options presented to you.

Start SplitsTree by double-clicking its icon. After it loads, use File > Open... to navigate to the Examples folder and open bees.nex. If you are having trouble finding this file, use the Look In list box in the File Open dialog to go all the way up to My Computer, then come down through Local Disk (C:), then Program Files, then SplitsTree, and finally look in the Examples subdirectory. Note the webbiness, which indicates that some characters prefer one tree, and other characters prefer a different tree.

In some cases, changing the model can help make the distances more additive, and SplitsTree can be used to visually explore the effects of applying different models. Try changing to the LogDet model by choosing Distances > Logdet from the main menu. Press the "Apply" button when it appears, then take a look at the tree again.

  • Did correcting the Hamming distances (raw dissimilarity) using the Logdet model help to make the distances more additive?
Map of NZ showing major cities. Click to enlarge
Most phylogeny programs will not make it clear when a tree is not the best way to represent the data. SplitsTree excels at this. Open the example file colors.nex. The distance matrix provided in this case comprises pairwise distances between different colors, which you would not expect to be expecially treelike. The example file south.nex contains the pairwise distances between cities on the south island of New Zealand (plus Wellington, which is the southernmost city on the north island). A quick look at the map of New Zealand on the right will convince you that the "tree" shown by SplitsTree is actually a fairly good map of the south island of New Zealand!

SplitsTree makes use of the NEXUS format, so you can create a distance matrix in PAUP* and then read it in to SplitsTree. SplitsTree uses several types of blocks that PAUP* does not recognize, however, including Splits, st_graph and st_assumptions. Some of these blocks are added to your data file when you execute it, which is why when you try to close a window SplitsTree asks you whether you want to save the modified data file.

Click on the "Data" tab, then expand Taxa and Splits. The Splits section is where you will find the weights for each split shown in the tree, in case you want to report these in a publication. Previous versions of SplitsTree would show these weights on the tree itself if you asked, but thus far I have been unsuccessful at getting the Java version to do this. The Java version does do a nice job of saving the tree in various graphics file formats, including jpg, eps, svg, gif or png. Use the File > Export Image... menu command to create a file of one of these types.

Checking your hand-calculated split decomposition

The first step to checking the hand-calculated splitstree you did for homework #3 is to create a NEXUS file containing a distances block. This file can be read into the program SplitsTree and the split decomposition calculated for comparison. Below is what a NEXUS file containing a distances block looks like. (The taxa block is also required by the SplitsTree program.) Be sure to replace the distances in this file with the JC distances you calculated by hand! The diagonal elements should be left as 0.0, however.


begin taxa;
  dimensions ntax=4;
  taxlabels taxon1 taxon2 taxon3 taxon4;

begin distances;
  dimensions ntax=4;
  format diagonal labels triangle=lower;
    taxon1 0.0
    taxon2 0.12 0.0
    taxon3 0.13 0.23 0.0
    taxon4 0.14 0.24 0.34 0.0

After replacing the pairwise distances with the JC distances you calculated for homework #3, save the file as homework3.nex and open it in SplitsTree. You will see what looks like a splitstree appear after the homework3.nex file is executed, but this is actually a NeighborNet tree rather than a split decomposition tree. In this simple case, they are identical, but you should choose Networks > SplitDecomposition to switch to the split decomposition tree (if only so that you will know how to do this in the future). Hit Apply when the dialog box appears, and then you can close the Networks dialog box.

Now you should enlarge the central box in your tree (more appropriately called a network in SplitsTree) by clicking repeatedly on the + magnifying glass button at the top of the window (or, alternatively, choosing View > Zoom In repeatedly from the main menu). Once the box occupies a substantial proportion of the screen, center in in the window and choose View > Nodes and Edges... from the main menu. Once the Nodes and Edges... dialog box appears, click on the Edges tab and make sure that the Weights checkbox is checked. Now close the Nodes and Edges... checkbox, and view the tree again. Try clicking on one side of the box to see which numerical value goes with that split. You should find that the values are the same as those you computed for homework #3.

References cited

  1. Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution 11:605-612.
  2. Felsenstein, J. 1985. Confidence intervals on phylogenies: an approach using the bootstrap. Evolution 39:783-791.