Difference between revisions of "Phylogenetics: r8s Lab"

From EEBedia
Jump to: navigation, search
(Viewing the results)
(Using cross-validation to determine optimum smoothing)
Line 81: Line 81:
  
 
=== Using cross-validation to determine optimum smoothing ===
 
=== Using cross-validation to determine optimum smoothing ===
Let's run one more example to illustrate PL with cross-validation:
+
Let's run one more example to illustrate the Penalized Likelihood (PL) approach with cross-validation. Download the file <tt>SAMPLE_CROSSVAL</tt> using curl:
  
  alleyn [/home/tmp1/dist/sample] 14% r8s -b -f SAMPLE_CROSSVAL > crossval.txt
+
  curl -o SAMPLE_CROSSVAL http://hydrodictyon.eeb.uconn.edu/eeb349/SAMPLE_CROSSVAL
  
Before looking at the output, look at the SAMPLE_CROSSVAL file itself (use pico if you want). The part of this file that is most different from SAMPLE_SIMPLE is near the end:
+
then modify your <tt>gor8s</tt> file to look like this (note that both the first and third lines were changed):
 +
 
 +
#$ -o r8slab/crossval.txt - j y
 +
cd $HOME/r8slab
 +
r8s -b -f SAMPLE_CROSSVAL
 +
 
 +
Before looking at the output, look at the <tt>SAMPLE_CROSSVAL</tt> file itself. The part of this file that is most different from <tt>SAMPLE_SIMPLE</tt> is near the end:
  
 
  divtime method=pl algorithm=tn cvStart=0 cvInc=0.5 cvNum=8 crossv=yes;
 
  divtime method=pl algorithm=tn cvStart=0 cvInc=0.5 cvNum=8 crossv=yes;
  
Which of the four methods mentioned in lecture is being used now?
+
The <tt>divtime</tt> command will do cross-validation for 8 values of the weighting parameter (which I referred to as lambda in lecture). This parameter determines how much weight to give to the penalty function compared to the likelihood function.
  
The divtime command will do cross-validation for 8 values of the weighting parameter (determines how much weight to give to the penalty function compared to the likelihood function).
+
* ''Which 8 values of the weighting parameter will be tried?''
  
Which 8 values of the weighting parameter will be tried?
+
Now look at the output saved in the file <tt>crossval.txt</tt>
  
Now look at the output saved in the file crossval.txt, using either the more command or the pico editor.
+
* ''Can you determine which value of the weighting parameter is best?''
  
Can you determine which value of the weighting parameter is best?
+
=== PL analysis using the optimum penalty weighting ===
PL analysis using the optimum penalty weighting
+
Now put together a new analysis using the optimum penalty weight derived from the cross validation analysis. Let's simply modify the SAMPLE_CROSSVAL file:
+
  
alleyn [/home/tmp1/dist/sample] 15% cp SAMPLE_CROSSVAL myfile.nex
+
Now put together a new analysis using the optimum penalty weight derived from the cross validation analysis. Start by making a copy of the <tt>SAMPLE_CROSSVAL</tt> file, naming the copy <tt>myfile.nex</tt>:
alleyn [/home/tmp1/dist/sample] 16% pico myfile.nex
+
  
Once you have myfile.nex open in pico, comment out the last two commands (set and divtime) in the r8s block, and add your own divtime command to do the analysis one more time using the optimal value for the penalty weight:
+
cp SAMPLE_CROSSVAL myfile.nex
 +
pico myfile.nex
 +
 
 +
Once you have <tt>myfile.nex</tt> open in <tt>pico</tt>, comment out the last two commands (<tt>set</tt> and <tt>divtime</tt>) in the <tt>r8s</tt> block, and add your own <tt>divtime</tt> command to do the analysis one more time using the optimal value for the penalty weight:
  
 
  #NEXUS
 
  #NEXUS
Line 119: Line 125:
 
  end;
 
  end;
  
Also add a command (after divtime and before end) to get a tree description with branch lengths expressed as times (look at SAMPLE_SIMPLE for an example of how to do this).  
+
Also add a command (after <tt>divtime</tt> and before <tt>end</tt>) to get a tree description with branch lengths expressed as times (look at <tt>SAMPLE_SIMPLE</tt> for an example of how to do this).  
Using TreeView to visualize/print/copy trees
+
 
We will now copy the tree description to a file on your local computer and open the file in Rod Page's TreeView program to view it. TreeView is a free program available (from here) for MacIntosh, Windows and Linux operating systems. The first step is to create an empty file into which we will paste the tree description. To do this, right-click on some empty space in your directory. From the menu that pops up, choose New > Text Document. Rename this newly created file r8s.tre. Double-click the file to open it.
+
Remember to modify your gor8s script to account for the fact that you want to run myfile.nex this time rather than <tt>SAMPLE_CROSSVAL</tt>:
 +
 
 +
#$ -o r8slab/myfile.txt - j y
 +
cd $HOME/r8slab
 +
r8s -b -f myfile.nex
 +
 
 +
Use <tt>qsub</tt> to start the analysis.
 +
 
 +
=== Using TreeView to visualize/print/copy trees ===
  
In the output of r8s after you run the above file, there should be a tree description at the very end. Use the cat command to show this tree description to you in the PuTTY console window (cat just displays the entire file, not stopping like more).
+
The next step will be to copy the tree description to a file on your local computer and open the file in Rod Page's TreeView program to view it. [http://taxonomy.zoology.gla.ac.uk/rod/treeview.html TreeView] is a free program available for MacIntosh, Windows and Linux operating systems. The first step is to create an empty file into which we will paste the tree description. To do this, right-click on some empty space in your directory (on your local PC machine). From the menu that pops up, choose ''New > Text Document''. Rename this newly created file <tt>r8s.tre</tt>. Open the file for editing in PAUP* (or any another program that can edit plain text files: for example, the Notepad program that comes with Windows will work).
  
Select the tree description (start selecting at the first left parenthesis and include the semicolon at the end, but leave out the "tree PAUP_1 = " part). Once it is selected, press either Ctrl-C or Ctrl-Ins to copy it to the clipboard.
+
Back on the cluster, in the output of <tt>r8s</tt> after you run the above file, there should be a tree description at the very end. Use the <tt>cat</tt> command to show this tree description to you in the PuTTY console window (<tt>cat</tt> just displays the entire file):
 +
cat myfile.txt
 +
Select the tree description (start selecting at the first left parenthesis and include the semicolon at the end, but leave out the "<tt>tree PAUP_1 =</tt> " part). Once it is selected, press either Ctrl-C or Ctrl-Ins to copy it to the clipboard.
  
Now click the title bar of the window containing your new, empty text file. Once that window has the input focus, Ctrl-V to paste in the tree description. Close the window and save the file.
+
Now paste the tree description into your <tt>r8s.tre</tt> document and save the file.
  
 
Because we provided a name for this file that ends in ".tre", TreeView will recognize it as a tree file and TreeView will start automatically when the file is double-clicked. Do that now.
 
Because we provided a name for this file that ends in ".tre", TreeView will recognize it as a tree file and TreeView will start automatically when the file is double-clicked. Do that now.
  
When TreeView opens, you should see the tree in cladogram view. To see divergence times appropriately scaled, choose Tree > Phylogram from TreeView's main menu.
+
When TreeView opens, you should see the tree in cladogram view. To see divergence times appropriately scaled, choose ''Tree > Phylogram'' from TreeView's main menu.
  
Note that you can use Ctrl-C to copy a tree from TreeView in a form that can be pasted into a graphics program such as Powerpoint, Illustrator, or Word. You can also print directly from TreeView, but the results are less than spectacular. I usually paste it into Illustrator and clean it up a bit before printing or saving it as a PDF file.
+
Note that you can use Ctrl-C to copy a tree from TreeView in a form that can be pasted into a graphics program such as Microsoft Powerpoint, Adobe Illustrator, or Microsoft Word. You can also print directly from TreeView, but the results are less than spectacular. I usually paste it into Illustrator and clean it up a bit before printing or saving it as a PDF file.
  
 
[[Category:EEB courses]]
 
[[Category:EEB courses]]
 
[[Category:Phylogenetics]]
 
[[Category:Phylogenetics]]

Revision as of 15:19, 20 April 2007

Under construction.png This article is still under construction.
Expect it to change frequently until this notice is removed.
Adiantum.png EEB 349: Phylogenetics
The goals of this final lab are to learn how to: (1) set up a NEXUS file for the r8s program; (2) interpret the most important parts of the output of r8s; (4) view a tree with divergence times produced by r8s in TreeView.

Using r8s to estimate divergence times

There is no version of r8s for the Windows operating system, so we will use the version installed on the Bioinformatics cluster instead (the machines composing the cluster are all MacIntoshes running OS X).

Start the program PuTTY and connect to bbcxsrv1.biotech.uconn.edu

If you do not find PuTTY on your desktop, you can download the program again from the PuTTY website. Using PuTTY, connect to bbcxsrv1.biotech.uconn.edu.

Inspecting SAMPLE_SIMPLE

r8s is installed on the cluster, so we don't need the program itself, but we do need the sample data file that comes with r8s for the lab today. I've placed the SAMPLE_SIMPLE file that comes with r8s on a web server, and you can download it directly to the cluster using the curl program. Note that you could download the program using a web browser, then transfer the file to the cluster using PSFTP (or some other SFTP program), but using curl is easier and faster:

First, create a directory for this lab in your home directory:

mkdir $HOME/r8slab
cd $HOME/r8slab

Now, download the SAMPLE_SIMPLE data file there:

curl -o SAMPLE_SIMPLE http://hydrodictyon.eeb.uconn.edu/eeb349/SAMPLE_SIMPLE

Use the pico editor to see the first few lines of the SAMPLE_SIMPLE file:

pico SAMPLE_SIMPLE

Note that r8s uses the NEXUS format for its input files. Remember that to exit pico, you need to press the key combination Ctrl-x.

See if you can answer these questions just by looking at the SAMPLE_SIMPLE file:

  • Which NEXUS blocks are present? (Hint: you should find 2 NEXUS blocks in this file)
  • Why does r8s need to know the number of sites? (as in the "blformat nsites =952 lengths=persite" command)

Adding a node of interest using the MRCA command

Tree in SAMPLE_SIMPLE file

In the SAMPLE_SIMPLE file, these two lines define names for two interior nodes of the tree:

mrca LP marchantia pisum;
mrca ANGIO amborella pisum;

I have labeled these nodes on the tree (right). You might need to click on the figure to expand it before you will be able to read the text. Can you see how these simple mrca (Most Recent Common Ancestor) statements uniquely define interior nodes?

Create an mrca statement to create a name for the node I have labeled SEED (there are many different ways you could do this, all of which would uniquely specify the labeled node). Add your statement in the SAMPLE_SIMPLE file beneath the other two using the pico editor.

  • Which line in the data file serves to calibrate the (relaxed) clock?

Running r8s on SAMPLE_SIMPLE

Create a file (using pico) named gor8s that will serve as your qsub script for running r8s:

#$ -o simple.txt -j y
cd $HOME/r8slab
r8s -b -f SAMPLE_SIMPLE

The #$ -o simple.txt -j y on the first line is an instruction that programs composing the Sun Grid Engine software pay attention to. Normally, in unix scripts, if a pound (hash) symbol (i.e. #) is the first thing on a line, the entire line is treated as a comment and ignored. Programs such as qsub (part of the Sun Grid Engine) also ignore lines beginning with just a hash symbol, but they do pay attention to lines that directly follow the hash symbol with a dollar sign. In this case, the #$ -o simple.txt -j y on the first line is interpreted as a command by qsub. The command instructs qsub to save the output (-o) of whatever program is being run to a file named simple.txt. The -j y tells it to join (-j) any error output to the same file (simple.txt) rather than creating a separate file just to report errors. The y just means "yes", as in "yes, do join error to the other output".

The second line, cd $HOME/r8slab, says to change directories (cd) into the r8slab subdirectory of your home directory. $HOME is an "environmental" variable that is always guaranteed to expand to the full path to your home directory. You can type the command env if you want to see the values of all environmental variables currently defined. The value of any of these can be used in script by prepending the variable name with the dollar sign. Note that if your gor8s script is in your home directory, you could just say cd r8slab rather than cd $HOME/r8slab. Being explicit, however, about the location of the r8slab directory means that you could move your gor8s script anywhere and it would still work the same way.

The -f option tells r8s that the data file name is coming next, and the -b option causes r8s to run in batch mode (which simply means it will execute the specified data file and then quit, rather than producing prompts and waiting for answers).

Ok, run your r8s analysis using the qsub command:

 qsub gor8s

This run will take almost no time at all. In fact, if you type qstat immediately after your qsub command, you may not see your run because it will have already finished!

You should look for the output (simple.txt) in your home directory. It is possible to tell qsub to save simple.txt in your r8slab subdirectory by making the first line of your gor8s script look like this:

#$ -o r8slab/simple.txt -j y

Note that you cannot, however, use $HOME (or other environmental variables) in this line because qsub doesn't know how to deal with that. Just remember that the default behavior of qsub is to save your output file to your home directory.

Viewing the results

Open simple.txt in pico and take a look at the output that r8s generated. You can ignore the following warning, which appears on the first couple of lines

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.

The reason for this warning is explained here, but it is easier to simply ignore the warning than to prevent it!

There is a lot of output here, so skip down to the interesting part, which is a section labeled Estimated ages and substitution rates for tree PAUP_9. This shows a big table with a line showing the age for every node in the tree.

  • What method did r8s use here? (in lecture I mentioned four methods that were implemented in this program; in this case it is using the least sophisticated of these four)
  • What is the age of the node labeled LP? Is this age surprising, or could you have predicted this before running r8s?
  • What is the age of the seed plants? (look for the node you named using your mrca command)

Using cross-validation to determine optimum smoothing

Let's run one more example to illustrate the Penalized Likelihood (PL) approach with cross-validation. Download the file SAMPLE_CROSSVAL using curl:

curl -o SAMPLE_CROSSVAL http://hydrodictyon.eeb.uconn.edu/eeb349/SAMPLE_CROSSVAL

then modify your gor8s file to look like this (note that both the first and third lines were changed):

#$ -o r8slab/crossval.txt - j y
cd $HOME/r8slab
r8s -b -f SAMPLE_CROSSVAL

Before looking at the output, look at the SAMPLE_CROSSVAL file itself. The part of this file that is most different from SAMPLE_SIMPLE is near the end:

divtime method=pl algorithm=tn cvStart=0 cvInc=0.5 cvNum=8 crossv=yes;

The divtime command will do cross-validation for 8 values of the weighting parameter (which I referred to as lambda in lecture). This parameter determines how much weight to give to the penalty function compared to the likelihood function.

  • Which 8 values of the weighting parameter will be tried?

Now look at the output saved in the file crossval.txt

  • Can you determine which value of the weighting parameter is best?

PL analysis using the optimum penalty weighting

Now put together a new analysis using the optimum penalty weight derived from the cross validation analysis. Start by making a copy of the SAMPLE_CROSSVAL file, naming the copy myfile.nex:

cp SAMPLE_CROSSVAL myfile.nex
pico myfile.nex

Once you have myfile.nex open in pico, comment out the last two commands (set and divtime) in the r8s block, and add your own divtime command to do the analysis one more time using the optimal value for the penalty weight:

#NEXUS
.
.
.
collapse;
fixage taxon=lp age=420;
set ftol=1e-7 maxiter=2000;
[set verbose=0;] [suppresses huge amount of output in CV analyses]
[divtime method=pl algorithm=tn cvStart=0 cvInc=0.5 cvNum=8 crossv=yes;]
set smoothing=2.5;
divtime method=pl algorithm=tn;
end;

Also add a command (after divtime and before end) to get a tree description with branch lengths expressed as times (look at SAMPLE_SIMPLE for an example of how to do this).

Remember to modify your gor8s script to account for the fact that you want to run myfile.nex this time rather than SAMPLE_CROSSVAL:

#$ -o r8slab/myfile.txt - j y
cd $HOME/r8slab
r8s -b -f myfile.nex

Use qsub to start the analysis.

Using TreeView to visualize/print/copy trees

The next step will be to copy the tree description to a file on your local computer and open the file in Rod Page's TreeView program to view it. TreeView is a free program available for MacIntosh, Windows and Linux operating systems. The first step is to create an empty file into which we will paste the tree description. To do this, right-click on some empty space in your directory (on your local PC machine). From the menu that pops up, choose New > Text Document. Rename this newly created file r8s.tre. Open the file for editing in PAUP* (or any another program that can edit plain text files: for example, the Notepad program that comes with Windows will work).

Back on the cluster, in the output of r8s after you run the above file, there should be a tree description at the very end. Use the cat command to show this tree description to you in the PuTTY console window (cat just displays the entire file):

cat myfile.txt

Select the tree description (start selecting at the first left parenthesis and include the semicolon at the end, but leave out the "tree PAUP_1 = " part). Once it is selected, press either Ctrl-C or Ctrl-Ins to copy it to the clipboard.

Now paste the tree description into your r8s.tre document and save the file.

Because we provided a name for this file that ends in ".tre", TreeView will recognize it as a tree file and TreeView will start automatically when the file is double-clicked. Do that now.

When TreeView opens, you should see the tree in cladogram view. To see divergence times appropriately scaled, choose Tree > Phylogram from TreeView's main menu.

Note that you can use Ctrl-C to copy a tree from TreeView in a form that can be pasted into a graphics program such as Microsoft Powerpoint, Adobe Illustrator, or Microsoft Word. You can also print directly from TreeView, but the results are less than spectacular. I usually paste it into Illustrator and clean it up a bit before printing or saving it as a PDF file.