Phylogenetics: Distances Lab

From EEBedia
Revision as of 18:49, 4 February 2007 by PaulLewis (Talk | contribs)

Jump to: navigation, search
Under construction.png This article is still under construction.
Expect it to change frequently until this notice is removed.
Adiantum.png EEB 349: Phylogenetics
The goal of this lab exercise is to show you how to conduct various distance based analyses in PAUP* and SplitsTree

Part A: Using PAUP* to check your answers for homework #3

Open/recreate the data file

Start by checking to see if you still have the angio35.nex file you used last week. If you did not save this file, go back to last week's lab and recreate it.

Create a PAUP* command file

Use File > New in PAUP* to create a new file, create a PAUP* block with the following commands and save the file as run.nex:

#nexus

begin paup;
  execute angio35.nex;
  dset dist=abs;
  delete 5-.;
  exclude missambig;  
  showdist;
end;

The file run.nex tells PAUP* to:

  • use the absolute number of nucleotide differences between taxa as the distance measure (dset dist=abs);
  • delete all taxa except the first four (delete 5-.);
  • exclude all sites that have missing or ambiguous data (exclude missambig); and
  • show the distance matrix (showdist)

Execute this file in PAUP*, and examine the output.

  • Is the distance matrix output the same as the one you were given for homework #3?
  • Is the number of characters included the same as the total number of nucleotides given in homework #3?

Calculate JC distances

The next step in homework #3 was to compute the Jukes-Cantor distances. To do this in PAUP*, just change the distance measure from abs to jc and re-execute run.nex.

  • Do the JC distances that PAUP* reports match the ones you calculated by hand?
  • You have seen two distance measures that PAUP* can calculate, but how could you get a list of all the distance measures it can compute (hint: take a look back at the description of the "? option" in the PAUP* tips section of the first lab)?

Estimate edge lengths using weighted least squares

Next, perform a search using weighted least squares (weighted usually implies that the power is 2 in the denominator of the sum-of-squares formula). Issue the command dset ? to get a listing of the current values of all distance settings.

  • What is the current setting for the power option?
  • What is the current setting for the objective</tt) option?

Add the following line to your run.nex file, just below the <tt>execute angio35.nex; line:

set criterion=distance autoclose;

This command tells PAUP* to use the distance optimality criterion specified by the objective option of the dset command during the search. If you were to leave this out, PAUP* would use the default optimality criterion (parsimony). The autoclose option tells PAUP* that it is ok to automatically close the search window when finished (saves having to press the "Close" button). Also, add the following two lines to the end of your paup block to perform the search and show the resulting best tree, including the estimated branch lengths:

alltrees;
describe 1 / brlens;

When you've saved the new version of run.nex, use Ctrl-r to re-execute it.

  • Are the branch lengths reported by PAUP* the same as those in "tree 2" given to you in the homework #3 assignment? (Note: just pay attention to the taxon numbers in parentheses. The names of the taxa are different: e.g. Ephedrasinica (1) = taxon1 in the homework assignment.)
  • Look for the line that begins "Weighted sum-of-squares =": does the value reported by PAUP* agree with your hand calculations?
  • In its output, PAUP* also gives you the score that would have been used were we using the minimum evolution criterion. Can you find it? Ask for help if you need to; PAUP* does not make this obvious. Write the value down for comparison with the value you will obtain in the next section.

Searching under the minimum evolution criterion

Before moving on to the next exercise, repeat the above search using the minimum evolution (or ME) criterion. To do this, you will need to add the objective=ME option to your dset command and re-execute run.nex.

  • Is the result what you expected based on your answer to the last question in the previous section? (Note: it is ok if the results differ slightly in the 5th. decimal place.)

Congratulations!

Starting (almost) from scratch, you have, by hand and also using PAUP*, computed evolutionary distances using the Jukes-Cantor model and evaluated a tree using the weighted least squares criterion!

Part B: Analysis of algae.nex

Create the data file algae.nex

Click here to get the data, then use Ctrl-a to copy all of it from your browser window and save it to a file named algae.nex in a folder on your local hard drive. These data were originally used in a 1994 study by Lockhart et al.[1] and comprise eight 16S ribosomal RNA sequences:

Anacystis a cyanobacterium (has chlorophyll a but not b or c)
Olithodiscus a chloroplast from a chromophyte alga (chlorophylls a and c)
Euglena a chloroplast from a photosynthetic euglenophyte protist
Chlorella a chloroplast from a a chlorophyte green alga
Chlamydomonas a chloroplast from a chlorophyte green alga
Marchantia a chloroplast from a thallose liverwort (non-vascular bryophyte land plant)
Oryza a chloroplast from a monocot (the flowering plant rice)
Nicotiana a chloroplast from a dicot (the flowering plant tobacco)

All of these organisms except Anacystis and Olithodiscus have chlorophylls a and b. It is probable (based on independent evidence) that all chlorophyll a/b-containing chloroplasts have a common endosymbiotic origin, so we would expect trees constructed from these data to show a branch separating Anacystis and Olithodiscus from everything else. The cyanobacterium Anacystis uses phycobilin accessory pigments rather than chlorophylls for photosynthesis, and the chromophyte alga Olithodiscus has chlorophylls a and c (but not b).

Create a PAUP* command file.

Instead of executing the file algae.nex directly in PAUP*, start PAUP* (cancel the dialog box that appears) and choose File > New from the main menu to create a blank text file. Copy the following text and paste it into the new text file window, then save this file as runalgae.nex, placing it in the same directory as algae.nex:
#nexus

begin paup;
  log file=lab3.txt start replace;
  set torder=right autoclose;
  execute algae.nex;
  outgroup Anacystis_nidulans;
  set criterion=distance;
  
  [!
  ******************************
  ** JC69 Model, ME Criterion **
  ******************************
  ]
  dset distance=jc objective=me;
  alltrees;
  describe 1;
  
  [other commands here]
  log stop;
end;

During the course of this lab, you can simply add commands to this paup block rather than creating a paup block in the data file itself. The set torder=right command simply causes trees to be displayed so that the outgroup is at the top and the tree appears to flow to the right (this is often called ladderizing right). Try changing this to set torder=left to see what ladderizing left looks like. The set autoclose command just makes the analysis run more quickly because PAUP* closes the progress dialog box automatically after a search is done rather than requiring you to press the "Close" button.

Perform an exhaustive search under the minimum evolution and least squares criteria

Use both the JC and logdet models combined with both Minimum Evolution and Least Squares (using the default power of 2) optimality criteria. Use dset dist=? to find out how to specify the model, and use dset objective=? to find out how to specify which of the two distance-based optimality criteria to use. Set up all 4 analyses (2 models times 2 optimality criteria) in the paup block (the first one has already been done for you), then run them all at once by executing the file. Note the printable comment (it starts with an exclamation point). Making comments like this in your paup block will allow you to easily find where the results from each model start in the output.

  • Which of the 4 analyses produced an estimated tree placing the chlorophyll a/b clade together?

The main point here is that the model used can make a difference in whether you get the correct answer. We will explore this dataset further using maximum likelihood in a later lab.

Perform a bootstrap analysis

Bootstrapping (introduced to phylogenetics by Joe Felsenstein in 1985[2]) is one of the most common methods used to assess which parts (i.e. edges) of an estimated tree are best supported and which are poorly supported by the data. In bootstrapping, many new datasets are created by sampling (with replacement) characters from the original dataset. A tree is estimated from each bootstrap replicate dataset, and at the end a consensus tree is produced that has in it all splits showing up in a majority of the bootstrap trees. The idea is that your original dataset can be thought of as one sampling of characters from a vast pool of characters, and producing other data sets by resampling comes as close as possible to collecting data from other genes similar to the one for which you did collect data. The majority-rule consensus tree does not include splits that were present in fewer than half of the trees estimated from bootstrap replicate data sets, but the bipartition table that is produced by PAUP* provides the relative frequency of these less common splits.

Perform a bootstrap analysis (500 replicates, heuristic search using least squares) under the F84 model. The commands for doing this are shown below. I suggest you copy these into a new paup block in your runalgae.nex file. Be sure to disable your previous paup block by renaming it; if you change the name of the other paup block slightly, for example, changing begin paup; to begin _paup, PAUP* will skip it when executing the file.
dset distance=f84 objective=lsfit power=2;
bootstrap nrep=500 search=heuristic bseed=5555;
  • Examine the bipartition table PAUP* produces and locate the bootstrap proportion for the bipartition separating the chlorophyll a/b organisms from Anacystis and Olithodiscus
  • What is the bootstrap proportion for the bipartition separating Olithodiscus and Euglena from all the other taxa?

Sometimes a split will just barely fail to make the cut (e.g. it appeared in 49% of the bootstrap trees), and so it is a good idea to check the bipartition table to make sure you don't fail to notice these.

Compare NJ with a comparable heuristic search

Conduct a Neighbor-joining (NJ) analysis using HKY distances and compare this with an exhaustive search using the minimum evolution criterion. The necessary commands are below. Put these commands in yet another new paup block in your runalgae.nex file. Again, be sure to disable all your other paup blocks so that they are not executed too:
dset distance=hky objective=me;
nj;
dscores 1;
alltrees;
dscores 1;
  • What is the minimum evolution score for the NJ tree?
  • What is the minimum evolution score for the globally optimal tree?
  • Did NJ find the globally optimal tree?

The point here is simply to make sure you are aware of the command for creating a neighbor-joining tree. Neighbor-joining is very popular, but you should be wary of NJ trees involving large numbers of taxa. For large problems, a full search starting from a NJ tree often finds trees with better minimum evolution scores than the starting (NJ) tree.

Calculate a pairwise distance

Use the following commands to create and display a matrix of pairwise dissimilarity values (the so-called "p-distances"):
dset distance=p;
showdist;
Choose one taxon pair and compute, by hand, the JC distance (d) using the formula below, where p is the p-distance (from PAUP*'s output) and log means natural logarithm:
    d = -0.75 log[1-(4/3)p]
  • What value do you get for d?
  • What value does PAUP* give for the JC distance? Use dset distance=jc; showdist; to find out.

The point of this last exercise is simply to show you how PAUP* computes evolutionary distances, at least for the simplest model. You will see where formulas like this come from in upcoming lectures.

Using SplitsTree

We will now abandon PAUP* and explore a program called SplitsTree for a moment. SplitsTree has been around for several years, but the version we are using (SplitsTree 4) is still in the beta test stage, so just be aware that you will notice a few glitches. If you need to use SplitsTree for a paper, and version 4 does not do the job for you, you can still download older versions (e.g. 3.2) that are much more stable at the SplitsTree home page.

SplitsTree is the primary program used to perform split decomposition, which I discussed in this lecture. If there is no shortcut for SplitsTree on your desktop, you can make one by going to Start | My Computer | C: drive | Program Files | SplitsTree and dragging (with the right mouse button) the program icon to your desktop. When you release the right mouse button, "Create shortcut here" should be one of the options presented to you.

Start SplitsTree by double-clicking its icon. After it loads, use File > Open... to navigate to the Examples folder and open bees.nex. Note the webbiness, which indicates that some characters prefer one tree, and other characters prefer a different tree.

In some cases, changing the model can help make the distances more additive, and SplitsTree can be used to visually explore the effects of applying different models. Try changing to the LogDet model by choosing Options > Logdet from the main menu. Press the "Apply" button when it appears, then take a look at the tree again.

  • Did correcting the Hamming distances (raw dissimilarity) using the Logdet model help to make the distances more additive?
Map of NZ showing major cities. Click to enlarge
Most phylogeny programs will not make it clear when a tree is not the best way to represent the data. SplitsTree excels at this. Open the example file colors.nex. The distance matrix provided in this case comprises pairwise distances between different colors, which you would not expect to be expecially treelike. The example file south.nex contains the pairwise distances between cities on the south island of New Zealand (plus Wellington, which is the southernmost city on the north island). A quick look at the map of New Zealand on the right will convince you that the "tree" shown by SplitsTree is actually a fairly good map of the south island of New Zealand!

SplitsTree makes use of the NEXUS format, so you can create a distance matrix in PAUP* and then read it in to SplitsTree. SplitsTree uses several types of blocks that PAUP* does not recognize, however, including Splits, st_graph and st_assumptions. Some of these blocks are added to your data file when you execute it, which is why when you try to close a window SplitsTree asks you whether you want to save the modified data file.

Click on the "Data" tab, then expand Taxa and Splits. The Splits section is where you will find the weights for each split shown in the tree, in case you want to report these in a publication. Previous versions of SplitsTree would show these weights on the tree itself if you asked, but thus far I have been unsuccessful at getting the Java version to do this. The Java version does do a nice job of saving the tree in various graphics file formats, including jpg, eps, svg, gif or png. Use the File > Export Image... menu command to create a file of one of these types.

Checking your hand-calculated split decomposition

To get a better feel for what SplitsTree is doing, let's do a split decomposition by hand and check the calculations using SplitsTree. First, create a new NEXUS data file containing the following distance matrix:
#nexus

begin taxa;
	dimensions ntax=4;
	taxlabels A B C D;
end;

begin distances;
	dimensions ntax=4;
	format labels nodiagonal triangle=lower;
	matrix
		A  
		B  135
		C  118 158
		D  138 134 131
		;
end;

Compute the split decomposition by using the formulas provided in lecture. This data matrix was contrived to be simple enough that you do not need a calculator, but you are welcome to use one. Once you have computed the weight of the two splits included, open the file you just created in SplitsTree. If you click on the "Data" tab after the file has loaded, then expand Splits, you should be able to find the two weights you calculated by hand. There are four other splits listed in addition to the two you computed by hand. These additional splits correspond to the terminal branches leading to the four taxa. I did not show you how to calculate the weights for this kind of split (splits that separate just one taxon from all the rest), but I will be happy to show you how to do it if you are curious.

References cited

  1. Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution 11:605-612.
  2. Felsenstein, J. 1985. Confidence intervals on phylogenies: an approach using the bootstrap. Evolution 39:783-791.