Difference between revisions of "Phylogenetics: Searching Lab"

From EEBedia
Jump to: navigation, search
(Conduct a second search with SPR swapping)
(Exhaustive parsimony search)
(37 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
|-
 
|-
 
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]
 
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span>
+
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
 
|-
 
|-
 
|This lab explores different search strategies under the parsimony criterion
 
|This lab explores different search strategies under the parsimony criterion
Line 8: Line 8:
  
 
=== Getting started ===
 
=== Getting started ===
Log into your account on the cluster (<tt>ssh username@bbcxsrv1.biotech.uconn.edu</tt>). If you do not have the [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/angio35.nex <tt>angio35.nex</tt>] file, you can recreate it as follows:
+
Log into your account on the cluster (<tt>ssh username@bbcsrv3.biotech.uconn.edu</tt>). Type the following:
 +
qlogin
 +
This asks the scheduler to find a node (computer) in the cluster that is currently not busy. It should transfer your session from the head node to a different node. The reason we are using qlogin today is that some of the analyses we are going to run take more than a second or two to run. If all of you ran these jobs on the head node, all of us would notice a significant slowdown in response time (as would anyone else trying to use the cluster).
 +
 
 +
Once you see the prompt, type
 +
module load paup/current
 +
This will make the most recent installed version available to you. Without this line, typing <tt>paup</tt> will probably start a version of the program that is (perhaps) years old!
 +
 
 +
=== Download a copy of the data file ===
 +
If you do not have the [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/angio35.nex <tt>angio35.nex</tt>] file, you can recreate it as follows:
 
  curl <nowiki>http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/angio35.nex</nowiki> > angio35.nex
 
  curl <nowiki>http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/angio35.nex</nowiki> > angio35.nex
 
The word curl stands for "Copy URL" and the program curl fetches the contents of the specified web site (which in this case is not really a web site, it is just a data file). Without the trailing "<tt>> angio35.nex</tt>", you would see the contents of the <tt>angio35.nex</tt> file simply spill across your screen. That "<tt>></tt>" symbol followed by a file name causes the output of curl to be captured and redirected to a file named <tt>angio35.nex</tt>.
 
The word curl stands for "Copy URL" and the program curl fetches the contents of the specified web site (which in this case is not really a web site, it is just a data file). Without the trailing "<tt>> angio35.nex</tt>", you would see the contents of the <tt>angio35.nex</tt> file simply spill across your screen. That "<tt>></tt>" symbol followed by a file name causes the output of curl to be captured and redirected to a file named <tt>angio35.nex</tt>.
  
 
=== Create a command file ===
 
=== Create a command file ===
Now start pico (do not specify a file name because you will be creating a new file) and enter the following text, saving the file as <tt>run.nex</tt>:
+
Now start the nano editor (do not specify a file name because you will be creating a new file) and enter the following text, saving the file as <tt>run.nex</tt>:
 
  #nexus
 
  #nexus
 
   
 
   
Line 20: Line 29:
 
   execute angio35.nex;
 
   execute angio35.nex;
 
  end;
 
  end;
There are at least a couple of advantages to creating little NEXUS files like <tt>run.nex</tt>. For now, the only advantage is that executing <tt>run.nex</tt> automatically starts a log file so that you will have a record of what you did. Later, when you get in the habit of putting commands in paup blocks, you will appreciate the separation of the data from the commands that initiate analyses. I have many times opened a data file, forgetting about the embedded paup block that starts a long search, overwrites my previous log file, and otherwise creates havoc!
+
There are at least a couple of advantages to creating little NEXUS files like <tt>run.nex</tt>. For now, the only advantage is that executing <tt>run.nex</tt> automatically starts a log file so that you will have a record of what you did. Later, when you get in the habit of putting commands in paup blocks, you will appreciate the fact that your data are separated from the commands that initiate analyses. I have many times opened a data file, forgetting about the embedded paup block that starts a long search and immediately overwrites my previous output files!
  
Note that because we used the <tt>replace</tt> keyword in the <tt>log</tt> command, the file <tt>output.txt</tt> will be overwritten without warning if it exists. Yes, this is called ''living dangerously'' but if you plan to run your analysis on the cluster, it is important to keep PAUP from asking you questions after you submit your run via <tt>qsub</tt>.
+
Note that because we used the <tt>replace</tt> keyword in the <tt>log</tt> command, the file <tt>output.txt</tt> will be overwritten without warning if it exists. Yes, this is called ''living dangerously'' but saves some frustration if a run must be restarted.
 +
 
 +
<div style="background-color: #ddddff">Add PAUP commands to this file as you work through the lab today. For each section, add new commands and comment out ones you do not want executed (using [square brackets]). Then, kill PAUP (using the <tt>quit</tt> command) and restart it using your new <tt>run.nex</tt> file</div>
  
 
=== Delete all taxa except the first five ===
 
=== Delete all taxa except the first five ===
Line 39: Line 50:
 
   execute angio35.nex;
 
   execute angio35.nex;
 
   delete 6-.;
 
   delete 6-.;
   alltrees;
+
   alltrees fd=barchart;
 
  end;
 
  end;
 
Start paup, specifying <tt>run.nex</tt> as the file to execute:
 
Start paup, specifying <tt>run.nex</tt> as the file to execute:
 
  paup run.nex
 
  paup run.nex
This analysis should go fast because you now have only 5 taxa.  
+
This analysis should go fast because you now have only 5 taxa. The <tt>fd=barchart</tt> setting tells PAUP to output a bar chart showing the distribution parsimony scores.  
  
<div style="background-color: #eeeeff; padding: 5px;">How many separate tree topologies did PAUP* examine? What is the parsimony treelength of the best tree? The worst tree? How many steps separate the best tree from the next best?</div>
+
<div style="background-color: #ddddff; padding: 5px;">How many separate tree topologies did PAUP* examine? What is the parsimony treelength of the best tree? The worst tree? How many steps separate the best tree from the next best? (consult the bar chart to determine the answer)</div>
  
 
=== Determine NNI rearrangements ===  
 
=== Determine NNI rearrangements ===  
Because we performed an exhaustive enumeration, we now know which tree is the globally most parsimonious tree. We are thus guaranteed to never find a better tree were we to start an heuristic search with this tree. Let's do an experiment: perform an NNI heuristic search, starting with the best tree, and have PAUP* save all the trees it encounters in this search. In the end, PAUP* will have in memory 5 trees: the starting tree and the 4 trees corresponding to all possible NNI rearrangements of that starting tree.
+
Because we performed an exhaustive enumeration, we now know which tree is the globally most parsimonious tree. We are thus guaranteed to never find a better tree were we to start an heuristic search with this tree. Let's do an experiment: perform an NNI heuristic search, starting with the best tree, and have PAUP* save all the trees it encounters in this search. In the end, PAUP* will have 5 trees in memory: the starting tree and the 4 trees corresponding to all possible NNI rearrangements of that starting tree.
  
Before you start the NNI search, use the <tt>showtree</tt> command to show you the tree obtained from the exhaustive enumeration. Draw this tree on a piece of paper and then draw the 4 possible NNI rearrangements (refer to the description of NNI in [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/lectures/2_Searching.pdf my lecture slides] if you've forgotten).
+
Before you start the NNI search, use the <tt>showtree</tt> command to show you the tree obtained from the exhaustive enumeration. Draw this tree on a piece of paper and then draw the 4 possible NNI rearrangements (refer to the description of NNI in your lecture notes if you've forgotten).
  
 
=== Perform an NNI search ===
 
=== Perform an NNI search ===
  
Issue this command at the <tt>paup></tt> prompt:
+
Add an <tt>hsearch</tt> and <tt>describe</tt> command to your <tt>run.nex</tt> file, which should afterwards look like this:
  hsearch start=1 swap=nni nbest=15;
+
  #nexus
 +
 +
begin paup;
 +
  log file=output.txt start replace;
 +
  execute angio35.nex;
 +
  delete 6-.;
 +
  alltrees;
 +
  hsearch start=1 swap=nni nbest=15;
 +
  describe all;
 +
end;
 +
The <tt>hsearch</tt> command is broken down as follows:
 
* <tt>start=1</tt> starts the search from the tree currently in memory (i.e., the best tree resulting from your exhaustive search using the parsimony criterion)
 
* <tt>start=1</tt> starts the search from the tree currently in memory (i.e., the best tree resulting from your exhaustive search using the parsimony criterion)
 
* <tt>swap=nni</tt> causes the Nearest-Neighbor Interchange (NNI) method to be used for branch swapping
 
* <tt>swap=nni</tt> causes the Nearest-Neighbor Interchange (NNI) method to be used for branch swapping
 
* <tt>nbest=15</tt> saves the 15 best trees found during the search. Thus, were PAUP to examine every possible tree, we would end up saving all of them in memory. The reason this command is needed is that PAUP ordinarily does not save trees that are worse than the best one it has seen thus far. Here, we are interested in seeing the trees that are examined during the course of the search, even if they are not as good as the starting tree.
 
* <tt>nbest=15</tt> saves the 15 best trees found during the search. Thus, were PAUP to examine every possible tree, we would end up saving all of them in memory. The reason this command is needed is that PAUP ordinarily does not save trees that are worse than the best one it has seen thus far. Here, we are interested in seeing the trees that are examined during the course of the search, even if they are not as good as the starting tree.
  
=== Show all 5 trees in memory ===
+
The <tt>describe all</tt> command plots the 5 trees currently in memory. The reason we are using the <tt>describe</tt> command rather than the <tt>showtrees</tt> command is because we want PAUP to show us the numbers it has assigned to the internal nodes, something that <tt>showtrees</tt> doesn't do.
Use the <tt>describe all</tt> command to plot the 5 trees currently in memory. The reason we are using the <tt>describe</tt> command rather than the <tt>showtrees</tt> command is because we want PAUP to show us the numbers it has assigned to the internal nodes, something that <tt>showtrees</tt> doesn't do.
+
  
<div style="background-color: #eeeeff; padding: 5px;">Which tree was the original tree? Which trees correspond to NNI rearrangments of which internal edges on the original tree?</div>
+
<div style="background-color: #ddddff; padding: 5px;">Which tree was the original tree? Which trees correspond to NNI rearrangments of which internal edges on the original tree?</div>
  
 
=== Find the most parsimonious tree for all 35 taxa ===  
 
=== Find the most parsimonious tree for all 35 taxa ===  
Restore all taxa using the <tt>restore all</tt> command (this will wipe out the 5 trees you currently have stored in memory, but that is ok), then conduct a heuristic search having the following characteristics:
+
Modify your <tt>run.nex</tt> file to conduct a heuristic search on ''all 35 taxa'' having the following characteristics:
* The starting trees are each generated by the stepwise addition method, using random addition of sequences (you will employ the <tt>addseq</tt> and <tt>search</tt> keywords for this)
+
* The starting trees are each generated by the stepwise addition method, using random addition of sequences (you will employ the <tt>addseq</tt> and <tt>start</tt> keywords for this)
 
* Swap using NNI branch swapping (you will employ the <tt>swap</tt> keyword for this)
 
* Swap using NNI branch swapping (you will employ the <tt>swap</tt> keyword for this)
 
* Reset the <tt>nbest</tt> option to <tt>all</tt> because we want to be saving just the best trees, not suboptimal trees (yes, this option is a little confusing).
 
* Reset the <tt>nbest</tt> option to <tt>all</tt> because we want to be saving just the best trees, not suboptimal trees (yes, this option is a little confusing).
Line 78: Line 98:
 
  hsearch start=stepwise addseq=random swap=nni nbest=all rseed=5555 nreps=500;
 
  hsearch start=stepwise addseq=random swap=nni nbest=all rseed=5555 nreps=500;
 
-->
 
-->
<div style="background-color: #eeeeff; padding: 5px;">How many tree islands were found? How long did the search take? How many rearrangements were tried?</div>
+
Remember you can comment out portions of your Nexus file if you don't want to lose them: e.g.,
 +
#nexus
 +
 +
begin paup;
 +
  log file=output.txt start replace;
 +
  execute angio35.nex;
 +
  [delete 6-.;]
 +
  [alltrees;]
 +
  hsearch [put your new options here];
 +
  [describe all;]
 +
end;
 +
<div style="background-color: #ddddff; padding: 5px;">How many tree islands were found? How long did the search take? How many rearrangements were tried?</div>
  
 
=== Conduct a second search with SPR swapping ===  
 
=== Conduct a second search with SPR swapping ===  
 
Construct another heuristic search using SPR branch swapping. Be sure to reset the random number seed to 5555.
 
Construct another heuristic search using SPR branch swapping. Be sure to reset the random number seed to 5555.
<div style="background-color: #eeeeff; padding: 5px;">How many tree islands were found? What are the scores of the trees in each island? How long did the search take this time? How many rearrangements were tried?</div>
+
<div style="background-color: #ddddff; padding: 5px;">How many tree islands were found? What are the scores of the trees in each island? How long did the search take this time? How many rearrangements were tried?</div>
  
 
=== Now conduct a third search with TBR swapping ===
 
=== Now conduct a third search with TBR swapping ===
<div style="background-color: #eeeeff; padding: 5px;">How many tree islands were found? What are the scores of the trees in each island? How long did the search take this time?How many rearrangements were tried? How many trees are currently in memory (use the treeinfo command)? Has PAUP* saved trees from all islands discovered during this search?</div>
+
<div style="background-color: #ddddff; padding: 5px;">How many tree islands were found? What are the scores of the trees in each island? How long did the search take this time?How many rearrangements were tried? How many trees are currently in memory (use the <tt>treeinfo</tt> command)? Has PAUP saved trees from all islands discovered during this search? (Hint: compare "Number of trees retained" to the sum of the "Size" column in the Tree-island profile.) Explain why PAUP saved the number of trees it did.</div>
  
(Hint: compare "Number of trees retained" to the sum of the "Size" column in the Tree-island profile.) Explain why PAUP* saved the number of trees it did.<br/><br/>Wondering about that warning "Multiple hits on islands of unsaved trees may in fact represent different islands"? When PAUP* encounters a new island, it will find all trees composing that particular island in the process of branch swapping. Thus, if (in a new search) it encounters any trees already stored in memory, it knows that it has hit an island that it found previously. Note that it would be pointless to continue on this tack, because it will only find all the trees on that island again. For trees retained in memory, PAUP* can keep track of which island they belong to (remember that it is possible for trees with the same parsimony score to be in different tree islands!). But for trees that are not retained in memory, PAUP* only knows that it has encountered an island of trees having score X; it has no way of finding out how many islands are actually represented amongst the trees having score X.
+
Wondering about this warning?
<div style="background-color: #eeeeff; padding: 5px;">Of the three types of branch swapping (NNI, SPR, TBR), which is the most thorough? Least thorough? How many TBR rearrangements can PAUP* examine on the computer you are using in one second (on average)? Based on this, how long would it take to examine all possible trees if only 20 sequences were included?</div>
+
''Multiple hits on islands of unsaved trees may in fact represent different islands''
 +
When PAUP encounters a new island, it will find all trees composing that particular island in the process of branch swapping. If, in a new search, it encounters any trees already stored in memory, it knows that it has hit an island that it found previously. Note that it would be pointless to continue on this tack, because it will only find all the trees on that island again. For trees retained in memory, PAUP can keep track of which island they belong to (remember that it is possible for trees with the same parsimony score to be in different tree islands!). But for trees that are not retained in memory, PAUP only knows that it has encountered an island of trees having score X; it has no way of finding out how many islands are actually represented amongst the trees having score X.
 +
<div style="background-color: #ddddff; padding: 5px;">Of the three types of branch swapping (NNI, SPR, TBR), which is the most thorough? Least thorough? How many TBR rearrangements can PAUP examine on the computer you are using in one second (on average)? Based on this, how long would it take to examine all possible trees if only 20 sequences were included?</div>
 +
 
 +
=== Compare NJ with a comparable heuristic search ===
 +
In this exercise, you will conduct a Neighbor-joining (NJ) analysis using HKY distances and compare this with an heuristic search using the minimum evolution criterion. The goal of this section is to demonstrate that it is possible for heuristic searching to find a better tree than NJ, even using the same optimality criterion.
 +
 
 +
Please quit PAUP* (if it is still running) and start it again. The purpose of restarting is to return all settings to their default values. You can also simply type the command <tt>factory</tt>, which resets all of PAUP's settings to their factory defaults.
 +
 
 +
Create a new file using nano containing the following lines. Note that we are again using the angio35.nex file:
 +
#nexus
 +
 +
begin paup;
 +
  execute angio35.nex;
 +
  dset distance=hky objective=me;
 +
  nj;
 +
  dscores 1; [!*** NJ score above ***]
 +
  hsearch start=1;
 +
  dscores 1; [!*** Heuristic search score above ***]
 +
end;
 +
(The text surrounded by square brackets is a comment, and the initial exclamation point ! tells PAUP that you would like this comment to appear in the output.) Run this file in paup by typing the following at the linux prompt:
 +
paup <filename>
 +
(Of course, replace <filename> with the actual name of the file you just created.)
 +
<div style="background-color: #ddddff; padding: 5px;">What is the minimum evolution score for the NJ tree? (scroll down from the beginning of the PAUP* output looking for the phrase "ME-score" right above the point where the comment <tt>*** NJ score above ***</tt> was printed)</div>
 +
<div style="background-color: #ddddff; padding: 5px;">What is the minimum evolution score for the tree found by heuristic search starting with the NJ tree? (Look just above the comment <tt>*** Heuristic search score above ***</tt>)</div>
 +
<div style="background-color: #ddddff; padding: 5px;">What is wrong with this picture? Why is the minimum evolution score of the heuristic search worse than that of the starting tree? (Hint: take a look at the "Heuristic search settings" section of the output.)</div>
 +
 
 +
Once you have figured out what is going on (ask us for help if you are stumped), fix your paup block and re-execute the file. You may need to get PAUP to help you with the criterion setting; type the following to get PAUP to spit out the current settings, then look for criterion near the top of the list:
 +
set ?
 +
 
 +
In your reanalysis, you should find that the heuristic search starting with the NJ tree found a better tree using the same optimality criterion (minimum evolution) being used by NJ. Neighbor-joining is very popular, but you should be wary of NJ trees involving large numbers of taxa. This analysis involved 35 taxa; for problem of this size or larger, it is almost certain that NJ will not find the best tree.
 +
 
 +
<!--
 +
=== Using SplitsTree ===
 +
We will now abandon PAUP* and explore a program called SplitsTree. You will use SplitsTree on your own computer, not the cluster. Go to the [http://splitstree.org/ SplitsTree web site] and download and install a version appropriate for your computer.
 +
 
 +
Start SplitsTree by double-clicking its icon. After it loads, use ''Open...'' from the ''File'' menu to navigate to the Examples folder and open <tt>bees.nex</tt>. Note the webbiness, which indicates that some characters prefer one tree, and other characters prefer a different tree. You can make the tree bigger using your scroll wheel (or the buttons at the top of the main window), and if you select a node, the program allows you to swivel the edges of the graph.
 +
 
 +
[[Image:New_Zealand_map.PNG|right|thumb|Map of NZ showing major cities. Click to enlarge]]Most phylogeny programs will not make it clear when a tree is not the best way to represent the data. SplitsTree excels at this. Open the example file <tt>colors.nex</tt>. The distance matrix provided in this case comprises pairwise distances between different colors, which you would not expect to be especially treelike.
 +
 
 +
The example file <tt>south.nex</tt> contains the pairwise distances between cities on the south island of New Zealand (plus Wellington, which is the southernmost city on the north island). A quick look at the map of New Zealand on the right will convince you that the "tree" shown by SplitsTree is actually a fairly good map of the south island of New Zealand!
 +
 
 +
SplitsTree makes use of the NEXUS format, so you can create a distance matrix in PAUP* and then read it in to SplitsTree. SplitsTree uses several types of blocks that PAUP* does not recognize, however, including <tt>Splits</tt>, <tt>st_graph</tt> and <tt>st_assumptions</tt>. Some of these blocks are added to your data file when you execute it, which is why when you try to close a window SplitsTree asks you whether you want to save the modified data file.
 +
 
 +
Return to the <tt>bees.nex</tt> example. To show the split weights on the tree, choose ''Select Edges'' from the ''Edit'' menu, then choose ''Format Nodes and Edges...'' from the ''View'' menu. In the dialog box that appears, check the ''Weights'' checkbox beside ''Edge labels''.
 +
 
 +
The SplitsTree program allows you to save the tree in various graphics file formats, including jpg, eps, svg, gif or png. Use the ''Export Image...'' menu command from the ''File'' menu to create a file of one of these types.
 +
-->
  
 
[[Category: Phylogenetics]]
 
[[Category: Phylogenetics]]

Revision as of 19:18, 23 January 2018

Adiantum.png EEB 5349: Phylogenetics
This lab explores different search strategies under the parsimony criterion

Getting started

Log into your account on the cluster (ssh username@bbcsrv3.biotech.uconn.edu). Type the following:

qlogin

This asks the scheduler to find a node (computer) in the cluster that is currently not busy. It should transfer your session from the head node to a different node. The reason we are using qlogin today is that some of the analyses we are going to run take more than a second or two to run. If all of you ran these jobs on the head node, all of us would notice a significant slowdown in response time (as would anyone else trying to use the cluster).

Once you see the prompt, type

module load paup/current

This will make the most recent installed version available to you. Without this line, typing paup will probably start a version of the program that is (perhaps) years old!

Download a copy of the data file

If you do not have the angio35.nex file, you can recreate it as follows:

curl http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/angio35.nex > angio35.nex

The word curl stands for "Copy URL" and the program curl fetches the contents of the specified web site (which in this case is not really a web site, it is just a data file). Without the trailing "> angio35.nex", you would see the contents of the angio35.nex file simply spill across your screen. That ">" symbol followed by a file name causes the output of curl to be captured and redirected to a file named angio35.nex.

Create a command file

Now start the nano editor (do not specify a file name because you will be creating a new file) and enter the following text, saving the file as run.nex:

#nexus

begin paup;
  log file=output.txt start replace;
  execute angio35.nex;
end;

There are at least a couple of advantages to creating little NEXUS files like run.nex. For now, the only advantage is that executing run.nex automatically starts a log file so that you will have a record of what you did. Later, when you get in the habit of putting commands in paup blocks, you will appreciate the fact that your data are separated from the commands that initiate analyses. I have many times opened a data file, forgetting about the embedded paup block that starts a long search and immediately overwrites my previous output files!

Note that because we used the replace keyword in the log command, the file output.txt will be overwritten without warning if it exists. Yes, this is called living dangerously but saves some frustration if a run must be restarted.

Add PAUP commands to this file as you work through the lab today. For each section, add new commands and comment out ones you do not want executed (using [square brackets]). Then, kill PAUP (using the quit command) and restart it using your new run.nex file

Delete all taxa except the first five

Using this command

delete 6-.

will cause PAUP* to ignore all taxa except Ephedrasinica, Gnetum_gnemJS, WelwitschiaJS, Ginkgo_biloba, and Pinus_ellCH.

Exhaustive parsimony search

Use the alltrees command to conduct an exhaustive search under the parsimony criterion (parsimony is the default optimality criterion).

You should add the delete and alltrees file to the paup block in your run.nex file, which should now look like this:

#nexus

begin paup;
  log file=output.txt start replace;
  execute angio35.nex;
  delete 6-.;
  alltrees fd=barchart;
end;

Start paup, specifying run.nex as the file to execute:

paup run.nex

This analysis should go fast because you now have only 5 taxa. The fd=barchart setting tells PAUP to output a bar chart showing the distribution parsimony scores.

How many separate tree topologies did PAUP* examine? What is the parsimony treelength of the best tree? The worst tree? How many steps separate the best tree from the next best? (consult the bar chart to determine the answer)

Determine NNI rearrangements

Because we performed an exhaustive enumeration, we now know which tree is the globally most parsimonious tree. We are thus guaranteed to never find a better tree were we to start an heuristic search with this tree. Let's do an experiment: perform an NNI heuristic search, starting with the best tree, and have PAUP* save all the trees it encounters in this search. In the end, PAUP* will have 5 trees in memory: the starting tree and the 4 trees corresponding to all possible NNI rearrangements of that starting tree.

Before you start the NNI search, use the showtree command to show you the tree obtained from the exhaustive enumeration. Draw this tree on a piece of paper and then draw the 4 possible NNI rearrangements (refer to the description of NNI in your lecture notes if you've forgotten).

Perform an NNI search

Add an hsearch and describe command to your run.nex file, which should afterwards look like this:

#nexus

begin paup;
  log file=output.txt start replace;
  execute angio35.nex;
  delete 6-.;
  alltrees;
  hsearch start=1 swap=nni nbest=15;
  describe all;
end;

The hsearch command is broken down as follows:

  • start=1 starts the search from the tree currently in memory (i.e., the best tree resulting from your exhaustive search using the parsimony criterion)
  • swap=nni causes the Nearest-Neighbor Interchange (NNI) method to be used for branch swapping
  • nbest=15 saves the 15 best trees found during the search. Thus, were PAUP to examine every possible tree, we would end up saving all of them in memory. The reason this command is needed is that PAUP ordinarily does not save trees that are worse than the best one it has seen thus far. Here, we are interested in seeing the trees that are examined during the course of the search, even if they are not as good as the starting tree.

The describe all command plots the 5 trees currently in memory. The reason we are using the describe command rather than the showtrees command is because we want PAUP to show us the numbers it has assigned to the internal nodes, something that showtrees doesn't do.

Which tree was the original tree? Which trees correspond to NNI rearrangments of which internal edges on the original tree?

Find the most parsimonious tree for all 35 taxa

Modify your run.nex file to conduct a heuristic search on all 35 taxa having the following characteristics:

  • The starting trees are each generated by the stepwise addition method, using random addition of sequences (you will employ the addseq and start keywords for this)
  • Swap using NNI branch swapping (you will employ the swap keyword for this)
  • Reset the nbest option to all because we want to be saving just the best trees, not suboptimal trees (yes, this option is a little confusing).
  • Set the random number seed to 5555 using the rseed option (this determines the sequence of pseudorandom numbers used for the random additions; ordinarily you would not need to set the random number seed, but we will do this here to ensure that we all get the same results)
  • Do 500 replicate searches; each replicate represents an independent search starting from a different random-addition tree (you will use the nreps keyword for this).

Use the following command to get PAUP to list the options for hsearch:

hsearch ?

Remember you can comment out portions of your Nexus file if you don't want to lose them: e.g.,

#nexus

begin paup;
  log file=output.txt start replace;
  execute angio35.nex;
  [delete 6-.;]
  [alltrees;]
  hsearch [put your new options here];
  [describe all;]
end;
How many tree islands were found? How long did the search take? How many rearrangements were tried?

Conduct a second search with SPR swapping

Construct another heuristic search using SPR branch swapping. Be sure to reset the random number seed to 5555.

How many tree islands were found? What are the scores of the trees in each island? How long did the search take this time? How many rearrangements were tried?

Now conduct a third search with TBR swapping

How many tree islands were found? What are the scores of the trees in each island? How long did the search take this time?How many rearrangements were tried? How many trees are currently in memory (use the treeinfo command)? Has PAUP saved trees from all islands discovered during this search? (Hint: compare "Number of trees retained" to the sum of the "Size" column in the Tree-island profile.) Explain why PAUP saved the number of trees it did.

Wondering about this warning? Multiple hits on islands of unsaved trees may in fact represent different islands When PAUP encounters a new island, it will find all trees composing that particular island in the process of branch swapping. If, in a new search, it encounters any trees already stored in memory, it knows that it has hit an island that it found previously. Note that it would be pointless to continue on this tack, because it will only find all the trees on that island again. For trees retained in memory, PAUP can keep track of which island they belong to (remember that it is possible for trees with the same parsimony score to be in different tree islands!). But for trees that are not retained in memory, PAUP only knows that it has encountered an island of trees having score X; it has no way of finding out how many islands are actually represented amongst the trees having score X.

Of the three types of branch swapping (NNI, SPR, TBR), which is the most thorough? Least thorough? How many TBR rearrangements can PAUP examine on the computer you are using in one second (on average)? Based on this, how long would it take to examine all possible trees if only 20 sequences were included?

Compare NJ with a comparable heuristic search

In this exercise, you will conduct a Neighbor-joining (NJ) analysis using HKY distances and compare this with an heuristic search using the minimum evolution criterion. The goal of this section is to demonstrate that it is possible for heuristic searching to find a better tree than NJ, even using the same optimality criterion.

Please quit PAUP* (if it is still running) and start it again. The purpose of restarting is to return all settings to their default values. You can also simply type the command factory, which resets all of PAUP's settings to their factory defaults.

Create a new file using nano containing the following lines. Note that we are again using the angio35.nex file:

#nexus

begin paup; 
  execute angio35.nex;
  dset distance=hky objective=me;
  nj;
  dscores 1; [!*** NJ score above ***]
  hsearch start=1;
  dscores 1; [!*** Heuristic search score above ***]
end;

(The text surrounded by square brackets is a comment, and the initial exclamation point ! tells PAUP that you would like this comment to appear in the output.) Run this file in paup by typing the following at the linux prompt:

paup <filename>

(Of course, replace <filename> with the actual name of the file you just created.)

What is the minimum evolution score for the NJ tree? (scroll down from the beginning of the PAUP* output looking for the phrase "ME-score" right above the point where the comment *** NJ score above *** was printed)
What is the minimum evolution score for the tree found by heuristic search starting with the NJ tree? (Look just above the comment *** Heuristic search score above ***)
What is wrong with this picture? Why is the minimum evolution score of the heuristic search worse than that of the starting tree? (Hint: take a look at the "Heuristic search settings" section of the output.)

Once you have figured out what is going on (ask us for help if you are stumped), fix your paup block and re-execute the file. You may need to get PAUP to help you with the criterion setting; type the following to get PAUP to spit out the current settings, then look for criterion near the top of the list:

set ?

In your reanalysis, you should find that the heuristic search starting with the NJ tree found a better tree using the same optimality criterion (minimum evolution) being used by NJ. Neighbor-joining is very popular, but you should be wary of NJ trees involving large numbers of taxa. This analysis involved 35 taxa; for problem of this size or larger, it is almost certain that NJ will not find the best tree.