Difference between revisions of "Phylogenetics: Bioinformatics Cluster"

From EEBedia
Jump to: navigation, search
(q)
(If you use Windows...)
 
(95 intermediate revisions by 3 users not shown)
Line 2: Line 2:
 
|-
 
|-
 
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
 
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span>
+
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
 
|-
 
|-
|The goal of this lab exercise is to show you how to use the [http://www.biotech.uconn.edu/bf/ Bioinformatics Facility] computer cluster to run PAUP* and GARLI.
+
|The goal of this lab exercise is to show you how to log into the [http://www.biotech.uconn.edu/bf/ Bioinformatics Facility] computer cluster and perform a basic PAUP* analysis.
 
|}
 
|}
  
== Part A: Using the UConn Bioinformatics Facility cluster ==
+
= Using the UConn Bioinformatics Facility cluster =
The Bioinformatics Facility is part of the UConn Biotechnology Center, which is located behind the Up-N-Atom Cafe in the lower level of the Biology/Physics building. [mailto:JEFFREY.LARY@uconn.edu Jeff Lary] maintains a 17-node Apple Xserve G5 Cluster that can be used by UConn graduate students and faculty to conduct bioinformatics-related research (sequence analysis, biological database searches, phylogenetics, molecular evolution). You have each been given accounts on the cluster, and today you will learn how to start analyses remotely (i.e. from this computer lab), check on their status, and download the results when your analysis is finished.
+
The UConn [http://bioinformatics.uconn.edu Computational Biology Core] is part of the [http://cgi.uconn.edu/ Center for Genome Innovation (CGI)]. [http://compgenomics.lab.uconn.edu/ We will use the BBC computing cluster located in the main UITS data center for most of the data crunching we will do in this course. You by now should have an account on the cluster, and today you will learn how to start analyses remotely (i.e. from your laptop), check on their status, and download the results when your analysis is finished.
  
=== Obtaining the necessary communications software ===
+
== Obtaining the necessary communications software ==
You will be using a couple of simple (and free) programs to communicate with the head node of the cluster. Visit the [http://www.putty.nl/download.html PuTTY web site], scroll down to the section labeled "Binaries" and save [http://the.earth.li/~sgtatham/putty/latest/x86/putty.exe putty.exe] and [http://the.earth.li/~sgtatham/putty/latest/x86/psftp.exe psftp.exe] on your desktop.  
+
You will be using a couple of simple (and free) programs to communicate with the head node of the cluster.  
  
==== PuTTY ====
+
=== If you use MacOS 10.x... ===
The program PuTTY  will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell) that encrypts everything sent over the internet. You will use PuTTY to send commands to the cluster and see the output generated. In the old days, a protocol known as Telnet was used for this purpose, but it is no longer used because it did not encrypt anything, making it easy for someone with access to the network to see your username and password in plain text.
+
==== SSH ====
 +
The program '''ssh''' will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use ssh to send commands to the cluster and see the output generated.  
  
==== PSFTP ====
+
Start by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive. Using the Terminal program, you can connect to the cluster with the following command:
The other program you will use is called PSFTP. It allows you to transfer files back and forth using SFTP (Secure File Transfer Protocol). It replaces the old protocol (FTP) that, like Telnet, sent usernames and passwords unencrypted across the network.
+
ssh username@bbcsrv3.biotech.uconn.edu
 +
where username should be replaced by your username on the cluster.
  
==== Programs vs. protocols ====
+
You may wish to install [http://www.iterm2.com/ iTerm2], which is a terminal program that makes some things easier than Terminal, but the built-in Terminal will work just fine.
SSH and SFTP are protocols, not programs. PuTTY and PSFTP are programs that implement the SSH and SFTP protocols, respectively. In a little while from now, you may be thinking "I liked FTP much better than SFTP!" because you probably used user-friendly, graphical FTP programs in the past. If you think this, then you are confusing protocols with programs that implement them. There are much fancier, easy-to-use programs for using SSH and SFTP than PuTTY and PSFTP, but these will serve us well today. One nice thing is that these programs are so small that you can just download them whenever and whereever you happen to need them. If you find yourself wanting a fancier SFTP client, check out [http://filezilla.sourceforge.net/ FileZilla] (on Windows) or [http://rsug.itd.umich.edu/software/fugu/ Fugu] (for Macs).
+
  
=== Logging in for the first time ===
+
If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create an alias for this command. Please ask your lab instructor for how to do this if you are interested.  
On the whiteboard you will find your login id (user name) and password.  
+
  
Double-click the PuTTY icon on your desktop to start the program. In the ''Host Name (or IP address)'' box, type <tt>bbcxsrv1.biotech.uconn.edu</tt>. Now type <tt>Bioinformatics cluster</tt> into the ''Saved Sessions'' box and press the Save button. This will save having to type the computer's name each time you want to connect. Now click the ''Open'' button to start a session. The first time you connect, you will get a ''PuTTY Security Alert''. Just press the ''Yes'' button to close this dialog.
+
<!-- as follows (in Terminal, but '''do this before you connect to the cluster''' and '''replace "username" with your own login user id'''):
 +
cat - >> .bash_profile
 +
alias cluster="ssh username@bbcsrv3.biotech.uconn.edu"
 +
Ctrl-d
 +
The '''cat''' command copies what you type (the - means take input from the user) and appends (the >> part) to the file .bash_profile. The Ctrl-d part tells the cat command you are finished typing. After you are done, type the following to cause your new .bash_profile file to be reloaded (it is normally loaded only once when you start Terminal):
 +
source .bash_profile
 +
-->
 +
 
 +
==== SCP/SFTP ====
 +
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). I will show you how to transfer files using both methods, but for now  you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, but I find that the command line clients let you get your work done faster once you get used to them.
 +
 
 +
=== If you use Windows... ===
 +
==== SSH ====
 +
The program '''PuTTY''' will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use PuTTY to send commands to the cluster and see the output generated.
 +
 
 +
Visit the [https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html PuTTY web site], scroll down to the section labeled "Binaries" and save both putty.exe and psftp.exe on your desktop (or use the MSI to download all the putty utilites as a bundle).
 +
 
 +
Double-click the PuTTY icon on your desktop to start the program. In the ''Host Name (or IP address)'' box, type <tt>bbcsrv3.biotech.uconn.edu</tt>. Now type <tt>Bioinformatics cluster</tt> into the ''Saved Sessions'' box and press the Save button. This will save having to type the computer's name each time you want to connect. Now click the ''Open'' button to start a session. The first time you connect, you will get a ''PuTTY Security Alert''. Just press the ''Yes'' button to close this dialog.
  
 
Now you should see the following prompt:
 
Now you should see the following prompt:
Line 32: Line 49:
 
  Password:
 
  Password:
 
Type in your password and press Enter. If all goes well, you should see something like this:
 
Type in your password and press Enter. If all goes well, you should see something like this:
  Welcome to Darwin!
+
  Rocks 6.1 (Emerald Boa)
  [bbcxsrv1:~] plewis%
+
  Profile built 08:40 02-Jan-2013
 +
 +
Kickstarted 04:02 02-Jan-2013
 +
.
 +
.
 +
.
 +
[plewis@bbcsrv3 ~]$
 
except that your username should appear instead of mine (plewis).
 
except that your username should appear instead of mine (plewis).
  
The first thing you should do is change your password. Type
+
==== SCP/SFTP ====
passwd
+
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). For now  you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, which makes moving files back and forth easy.
and press the Enter key, then follow the directions to change your password. If you have trouble thinking up passwords that are acceptable, check out the [http://www.multicians.org/thvv/gpw.html Java Password Generator] web site. It generates passwords that are not really words but sound like they are, so they are easier to remember than completely random passwords.
+
  
 
=== Learning enough UNIX to get around ===
 
=== Learning enough UNIX to get around ===
 
I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.
 
I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.
 
The cluster comprises MacIntosh G5 computers running MacOSX, but MacOSX is essentially a [[Wikipedia:Unix|UNIX]] operating system with a very nice user interface. But today you will not be using the nice user interface! Instead, you will be communicating using the UNIX command console, so the first step is to learn a few important UNIX commands. You can use these commands on any MacIntosh running MacOSX by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive.
 
[[Image:new.png]]If using the Mac Terminal program, you can connect to the cluster with the following command:
 
ssh username@bbcxsrv1.biotech.uconn.edu
 
where username should be replaced by your username on the cluster.
 
  
 
==== ls command: finding out what is in the present working directory ====
 
==== ls command: finding out what is in the present working directory ====
Line 70: Line 87:
 
You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself
 
You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself
 
  cd
 
  cd
[[Image:new.png]]If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:
+
If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:
 
  cd ..
 
  cd ..
  
==== Creating the gopaup file using the cat command ====
+
==== Creating run.nex using the nano editor ====
The cat command was designed for concatenating files, but I most often use it for viewing and creating files. To create a new file named <tt>gopaup</tt>, type the following (be sure to leave a space between each item, just as you see it below) and then press the Enter key
+
cat - > gopaup
+
Note that you no longer see the unix prompt, and the system appears to be hung. This is ok! The text you typed is admittedly somewhat cryptic:
+
* The hyphen (-) after the word cat means "use text typed from the console"
+
* The greater-than symbol (>) means "redirect the output to a file"
+
* The <tt>gopaup</tt> part is the name of the file to which the output will be redirected
+
The cat command is now waiting for you to type something. Type the following and, when finished, [[Image:new.png]]press the Enter key to make sure that the file ends on a blank line. Press the Ctrl-d key combination to tell cat that you are done:
+
#$ -o junk.txt -j y
+
cd $HOME/pauprun
+
paup -n run.nex
+
If you make mistakes while typing, don't fear! You can fix them later using the pico editor. Use the cat command again to view the contents of the file you just created:
+
cat gopaup
+
 
+
Basically, cat just spews out the contents of whatever you give it to work with. If you give it a file name, it spews out the contents of the file. If you give it a hyphen, it reads text you type until you press Ctrl-d, then it spews that text out again. In your case, when you used the hyphen, you also told it to redirect its output to the file gopaup, so that's why you did not see what it spewed.
+
 
+
==== Creating run.nex using the pico editor ====
+
  
Another way to create a new file, or edit one that already exists, is to use the pico editor. Most people like using pico better than cat for creating new files: the only advantage cat has over pico is that it is guaranteed to be present on every UNIX computer, whereas pico is only present on some. You will now use pico to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.
+
One way to create a new file, or edit one that already exists, is to use the nano editor. You will now use nano to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.
  
 
First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type
 
First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type
  pico run.nex
+
  nano run.nex
This will open the pico editor, and it should say <nowiki>[ New File ]</nowiki> at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit pico.
+
This will open the nano editor, and it should say <nowiki>[ New File ]</nowiki> at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit nano.
  
 
For now, type the following into the editor:
 
For now, type the following into the editor:
Line 106: Line 107:
 
   lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
 
   lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
 
   hsearch swap=none start=stepwise addseq=random nrep=1;
 
   hsearch swap=none start=stepwise addseq=random nrep=1;
 +
  lscores 1;
 
   lset basefreq=previous tratio=previous shape=previous;
 
   lset basefreq=previous tratio=previous shape=previous;
 
   hsearch swap=tbr start=1;
 
   hsearch swap=tbr start=1;
Line 112: Line 114:
 
   quit;
 
   quit;
 
  end;
 
  end;
Once you have entered everything, use ^X to exit. Pico will ask if you want to save the modified buffer, at which point you should press the Y key to answer yes. Pico will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Pico should now exit and you can use cat to look at the contents of the file you just created:
+
Once you have entered everything, use ^X to exit. Nano will ask if you want to save the modified '''buffer''' (a ''buffer'' is a predefined amount of computer memory used to store the text you type; the text stored in the buffer will be lost once you exit the program unless you save it to a file on the hard drive), at which point you should press the Y key to answer yes. Nano will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Nano should now exit and you can use cat to look at the contents of the file you just created:
 
  cat run.nex
 
  cat run.nex
  
=== Using PSFTP to upload the algae.nex data file ===
+
==== Create the gopaup file ====
 +
Now use nano to create a second file named <tt>gopaup</tt> in your home directory (the parent directory of the <tt>pauprun</tt> directory). To do this, type <tt>nano gopaup</tt>. This file should contain this text:
 +
#$ -S /bin/bash
 +
#$ -o junk.txt -j y
 +
cd $HOME/pauprun
 +
module load paup/current
 +
paup -n run.nex
 +
 
 +
=== Using Cyberduck to upload the algae.nex data file ===
 +
[[File:Cyberduck bookmark.png|right]]
 +
Download the file <tt>algae.nex</tt> from [http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex here] and save it on your hard drive.
 +
 
 +
Open Cyberduck, choose Bookmark > New Bookmark from the main menu, then fill out the resulting dialog box as shown on the right (except substitute your own user name, of course). Be sure to change the protocol to SFTP (not the default FTP). Click the X button to close and you should see your bookmark appear at the bottom of the main window. Double click the bookmark to open a connection. You will then be warned that the host key is unknown - choose Allow (and go ahead and check the Always button so you do not need to do this every time.
  
Locate the file <tt>algae.nex</tt> that we used in the previous lab. If you have deleted it, you will need to [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/Spring2005/data/algae.nex download] and save it again.
+
Once you are in, you will see a listing of the files in your home directory (if you have any). To copy the <tt>algae.nex</tt> file to the cluster, you need only drag-and-drop it onto the Cyberduck window.
  
Make sure that algae.nex is in the same place as the PSFTP program, then start PSFTP by double-clicking it.
+
===  (Mac/Linux users only) Using scp to upload the algae.nex data file ===
  
PSFTP should say something like this:
+
Mac users have two options. While you will probably want to do your file transfers with Cyberduck as described above, it is also possible to transfer files using the command line scp client (scp is also a good option for Linux users). Read on if you are interested in this option, but feel free to skip this section if you are happy using Cyberduck.
psftp: no hostname specified; use "open host.name" to connect
+
To open a connection to the cluster, type
+
open bbcxsrv1.biotech.uconn.edu
+
then supply your username and password when prompted.
+
  
To upload algae.nex to the cluster, type
+
Open the Terminal  application and navigate to where you saved the file. If you saved it on the desktop, you can go there by typing <tt>cd Desktop</tt>.
put algae.nex
+
  
If you do not see any error messages, then you can assume that the transfer worked. Type
+
Type the following to upload algae.nex to the cluster:
  quit
+
  scp algae.nex username@bbcsrv3.biotech.uconn.edu:
to exit the PSFTP program.
+
where <tt>username</tt> should be replaced by your own user name on the cluster. (Don't forget the colon on the very end of the line!)
  
 
=== A few more UNIX commands ===
 
=== A few more UNIX commands ===
  
You have now transfered a large file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file should be in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains this line
+
You have now transfered a large file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file is currently in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains a line containing the command <tt>execute algae.nex</tt>, which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:
execute algae.nex
+
which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:
+
 
   cd $HOME
 
   cd $HOME
 
   ls algae.*
 
   ls algae.*
Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters <tt>algae</tt> followed by a period and any number of other non-whitespace characters.
+
Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters <tt>algae</tt> followed by a period and any number of other non-whitespace characters. The <tt>$HOME</tt> is a predefined shell variable that will be replaced with your home directory. It is not necessary in this case - typing <tt>cd</tt> all by itself would take you to your home directory - but the <tt>$HOME</tt> variable is good to know about (especially for use in scripts).
  
 
==== mv command: moving or renaming a file ====
 
==== mv command: moving or renaming a file ====
Line 149: Line 156:
  
 
==== cp command: copying a file ====
 
==== cp command: copying a file ====
The cp command copies files. It leaves the original file in place and makes a copy elsehwere. You could have used this command to get a copy of algae.nex into the directory pauprun:
+
The cp command copies files. It leaves the original file in place and makes a copy elsewhere. You could have used this command to get a copy of algae.nex into the directory pauprun:
 
  cp algae.nex pauprun
 
  cp algae.nex pauprun
 
This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.
 
This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.
Line 165: Line 172:
 
=== Starting a PAUP* analysis ===
 
=== Starting a PAUP* analysis ===
  
If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your <tt>$HOME/pauprun</tt> directory on the cluster, whereas the gopaup file should be in <tt>$HOME</tt>. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:
+
If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your <tt>$HOME/pauprun</tt> directory on the cluster, whereas the gopaup file should be in <tt>$HOME</tt>. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:  
 +
#$ -S /bin/bash
 
  #$ -o junk.txt -j y
 
  #$ -o junk.txt -j y
 
  cd $HOME/pauprun
 
  cd $HOME/pauprun
 +
module load paup/current
 
  paup -n run.nex
 
  paup -n run.nex
This file will be used by software called the Sun Grid Engine (SGE for short) to start your run. SGE provides a command called qsub that you will use to submit your analysis. SGE will then look for a node (i.e. machine) in the cluster that is currently not being used (or is being used to a lesser extent than other nodes) and will start your analysis on that node. This saves you the effort of looking amongst all 17 nodes in the cluster for one that is not busy.
+
This file will be used by software called the Sun Grid Engine (SGE for short) to start your run. SGE provides a command called qsub that you will use to submit your analysis. The SGE qsub command will look for a core (i.e. processor) on a node (i.e. machine) in the cluster that is currently not being used and will start your analysis on that node. This saves you the effort of looking amongst all nodes in the cluster for a core that is not busy.
  
 
Here is an explanation of each of the lines in gopaup:
 
Here is an explanation of each of the lines in gopaup:
* Lines beginning with the two characters #$ are interpreted as commands by SGE itself. In this case, the command tells SGE to send any output from the program to a file named junk.txt and the -j y part says to append any error output to this as well (the j stands for join and the y for yes)
+
* Lines beginning with the two characters #$ are interpreted as commands by SGE itself. In this case, the first #$ command tells SGE to interpret what follows as a bash script and the second #$ command causes SGE to send any output from the program to a file named junk.txt, and the -j y part says to append any error output to this as well (the j stands for join and the y for yes)
* The second line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
+
* The third line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
* The third and last line simply starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.
+
* The fourth line sets environmental variables such that invoking paup starts the most recent installed version of the program. If you left this line out you would end up running an older version of paup.
 +
* The fifth and last line starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.
  
 
==== Submitting a job using qsub ====
 
==== Submitting a job using qsub ====
Line 183: Line 193:
 
You can see if your run is still going using the qstat command:
 
You can see if your run is still going using the qstat command:
 
  qstat
 
  qstat
If it is running, you will see an entry containing gopaup and the state will be r, for running. Here is what it looked like for me (I've omitted the rightmost part):
+
If it is running, you will see an entry containing gopaup and the state will be r (running), or qw (queued, waiting). Here is what it looked like for me:
  job-ID  prior  name      user        state submit/start at    queue
+
[plewis@bbcsrv3 ~]$ qstat
  -----------------------------------------------------------------------------------------------
+
  job-ID  prior  name      user        state submit/start at    queue                         slots ja-task-ID
    5540 0.55500 gopaup    plewis      r    02/18/2007 13:38:47 all.q@node003.cluster.private
+
  -----------------------------------------------------------------------------------------------------------------
    5535 0.55500 bskinkultr jockusch    r    02/16/2007 16:18:49 all.q@node006.cluster.private
+
  24747 0.50500 gopaup    plewis      r    01/18/2014 21:30:02 all.q@compute-1-2.local            1
    5525 0.55500 mb.sh      plapierre    r    02/15/2007 10:46:17 all.q@node010.cluster.private
+
This indicates that my job is now running on the node named compute-1-2.
    5433 0.55500 mb.sh      plapierre    r    02/08/2007 18:40:50 all.q@node012.cluster.private
+
    5539 0.55500 bskinkultr jockusch    r    02/18/2007 12:22:55 all.q@node015.cluster.private
+
My run is listed first, and is currently running on node 3 of the cluster.
+
  
 
==== Killing a job using qdel ====
 
==== Killing a job using qdel ====
  
Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the qdel command for this. Note that in the output of the qstat command above, my run had a job-ID equal to 5540. I could kill the job like this:
+
Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the qdel command for this. Note that in the output of the qstat command above, my run had a job-ID equal to 24747. I could kill the job like this:
  qdel 5540
+
  qdel 24747
 
SGE will say that it has scheduled the job for deletion, but in practice it kills it almost instantaneously in my experience. Be sure to delete any output files that have already been created before starting your run over again.
 
SGE will say that it has scheduled the job for deletion, but in practice it kills it almost instantaneously in my experience. Be sure to delete any output files that have already been created before starting your run over again.
  
Line 205: Line 212:
 
  cat algae.output.txt
 
  cat algae.output.txt
  
==== Using PSFTP to download the resulting treefile ====
+
==== Using Cyberduck to download the log file and the tree file ====
  
When PAUP* finishes, qstat will no longer list your process. At this point, you need to use PSFTP to get the log and tree files that were saved back to your local computer. Start up PSFTP and type
+
When PAUP* finishes, qstat will no longer list your process. At this point, you need to use Cyberduck to get the log and tree files that were saved back to your local computer. Assuming you left Cyberduck open and connected to the cluster, double-click on the pauprun directory and locate the files <tt>algae.ml.tre</tt> and  <tt>algae.output.txt</tt>. Select these two files and drag them out of the Cyberduck window and drop them on your desktop. After a flurry of activity, Cyberduck should report that the two files were downloaded successfully, at which point you can close the download status window.
open bbcxsrv1.biotech.uconn.edu
+
 
Note that both the log file (algae.output.txt) and the tree file (algae.ml.tre) are in the pauprun directory. PSFTP dropped you in your home directory, but you can tell PSFTP to change to the pauprun directory in the same way you tell UNIX that you want to change directories:
+
==== (Mac users only) Using scp to download the log file and the tree file ====
  cd pauprun
+
 
You can likewise type ls in PSFTP to get a listing of files. Do this now to make sure you see the two files you want to download. Now, use the get command to download the files:
+
Mac users can also use scp to get the log and tree files that were saved back to your local computer, but, again, if you are happy with Cyberduck you can skip this section. In the Terminal app on your Mac, press Cmd-t to open a new tab, which will create a new session starting in your home directory on your Mac), then type the following (being careful to separate the final dot character from everything else by a blank space):
get algae.ml.tre
+
  cd Desktop
get algae.output.txt
+
scp username@bbcsrv3.biotech.uconn.edu:pauprun/algae.output.txt .
Finally, close PSFTP using any of the following commands: quit, exit, bye.
+
scp username@bbcsrv3.biotech.uconn.edu:pauprun/algae.ml.tre .
 +
Again, be sure to replace <tt>username</tt> with your own user name on the cluster. The first command makes your current directory the Desktop folder, and the next two commands copy the files <tt>algae.output.txt</tt> and <tt>algae.ml.tre</tt> to your current directory (this is what the single dot at the end of each line stands for).
 +
 
 +
=== Using FigTree to view tree files ===
 +
[[File:AlgaeMLtree.png|right|400px|thumb|<tt>algae.ml.tre</tt> file viewed with FigTree]]
 +
If you do not already have it, download and install the [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] application on your laptop. FigTree is a Java application, so you will also need to install a Java Runtime Environment (JRE) if you don't already have one (just start FigTree and it will tell you if it cannot find a JRE). Once FigTree is running, choose File > Open from the menu to open the <tt>algae.ml.tre</tt> file.
 +
 
 +
==== Adjusting taxon label font ====
 +
 
 +
The first thing you will probably want to do is make the taxon labels larger or change the font. Expand the Tip Labels section on the left and play with the Font Size up/down control. You can also set font details in the preferences, which will save you a lot of time in the future
 +
 
 +
==== Line thickness ====
 +
 
 +
You can modify the thickness of the lines used by FigTree to draw the edges of the tree by expanding the Appearance tab.
 +
 
 +
==== Ladderization ====
 +
 
 +
You can ladderize the tree (make it appear to flow one way or the other) by playing with the Order Nodes option in the Trees tab.
 +
 
 +
==== Export tree as PDF ====
 +
 
 +
There are many other options that you can discover in FigTree, but one more thing to try today is to save the tree as a PDF file. Once you have the tree looking just the way you want, choose File > Export PDF...
  
 
==== Why both junk.txt and algae.output.txt? ====
 
==== Why both junk.txt and algae.output.txt? ====
Line 222: Line 250:
 
==== Delete junk.txt using the rm command ====
 
==== Delete junk.txt using the rm command ====
  
Because you do not need junk.txt, delete it using the rm command:
+
Because you do not need junk.txt, delete it using the rm command (the -f stands for force; i.e. don't ask if it is ok, just do it!):
 
  cd
 
  cd
 
  rm -f junk.txt
 
  rm -f junk.txt
Line 228: Line 256:
 
  cd pauprun
 
  cd pauprun
 
  rm -f algae.ml.tre
 
  rm -f algae.ml.tre
  rm -r algae.output.txt
+
  rm -f algae.output.txt
 
It is a good idea to delete files you no longer need for two reasons:
 
It is a good idea to delete files you no longer need for two reasons:
 
* you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
 
* you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
Line 235: Line 263:
 
=== Tips and tricks ===
 
=== Tips and tricks ===
  
Here are some miscellaneous tips and tricks to make your life easier while using PuTTY to communicate with the cluster.
+
Here are some miscellaneous tips and tricks to make your life easier when communicating with the cluster.
  
 
==== Command completion using the tab key ====
 
==== Command completion using the tab key ====
  
You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and PuTTY will try to complete the thought. For example, cd into the pauprun directory, then type
+
You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and the command interpreter will try to complete the thought. For example, cd into the pauprun directory, then type
 
  cat alg<TAB>
 
  cat alg<TAB>
If algae.nex is the only file in the directory in which the first three letters are alg, then PuTTY will type in the rest of the file name for you.
+
If algae.nex is the only file in the directory in which the first three letters are alg, then the command interpreter will type in the rest of the file name for you.
  
 
==== Wildcards ====
 
==== Wildcards ====
Line 256: Line 284:
 
It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.
 
It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.
  
== Part B: Starting a GARLI run on the cluster ==
+
[[Category: Phylogenetics]]
 
+
__NOTOC__
[http://www.bio.utexas.edu/faculty/antisense/garli/Garli.html GARLI] is a program written by [http://www.nescent.org/dir/postdoctoral_fellow.php?id=00024 Derrick Zwickl] for estimating the phylogeny using maximum likelihood, and is currently one of the best programs to use if you have a large problem (i.e. many taxa). I used GARLI to estimate the 738-taxon green plant phylogeny on the poster outside my office door in less than 5 hours. Another excellent ML program for large problems is [http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm RAxML], written by [http://icwww.epfl.ch/~stamatak/ Alexandros Stamatakis].
+
 
+
GARLI does not give you much choice in the way of search strategy or substitution model. It uses the GTR+I+G model (General Time Reversible substitution model, with invariable sites and discrete gamma rate heterogeneity), and uses a genetic algorithm search strategy. The genetic algorithm (or GA, for short) search strategy is like other heuristic search strategies in that it cannot guarantee that the optimal tree will be found. Thus, as will all heuristic searches, it is a good idea to run GARLI several times (using different pseudorandom number seeds) to see if there is any variation in the estimated tree.
+
 
+
Today you will run GARLI on the cluster for a dataset with 50 taxa. This is not a particularly large problem, but then you only have an hour or so to get this done!
+
 
+
=== Preparing the GARLI control file ===
+
 
+
Like many non-interactive programs, GARLI uses a control file to specify the settings it will use during a run. Here is the control file distributed with GARLI:
+
[general]
+
datafname = rana.phy
+
streefname = random
+
ofprefix = ranaGarli
+
randseed = -1
+
megsclamemory = 500
+
availablememory = 512
+
logevery = 10
+
saveevery = 100
+
refinestart = 1
+
outputeachbettertopology = 1
+
enforcetermconditions = 1
+
genthreshfortopoterm = 20000
+
scorethreshforterm = .05
+
significanttopochange = 0.05
+
outputphyliptree = 0
+
outputmostlyuselessfiles = 0
+
dontinferproportioninvariant = 0 
+
+
[master]
+
nindivs = 4
+
holdover = 1
+
selectionintensity = .5
+
holdoverpenalty = 0
+
stopgen = 5000000
+
stoptime = 5000000
+
+
startoptprec = .5
+
minoptprec = .01
+
numberofprecreductions = 40
+
topoweight = 1.0
+
modweight = .05
+
brlenweight = .2
+
randnniweight = .2
+
randsprweight = .3
+
limsprweight =  .5
+
intervallength = 100
+
intervalstostore = 5
+
+
limsprrange = 6
+
meanbrlenmuts = 5
+
gammashapebrlen = 1000
+
gammashapemodel = 1000
+
+
bootstrapreps = 0
+
inferinternalstateprobs = 0
+
Most of these settings are fine, but you will need to change a few of them before running GARLI.
+
 
+
==== Using curl to download the file ====
+
 
+
The first step is to get this file onto the cluster where you can use pico to edit it. You have already learned several ways to create a file on the cluster:
+
* cat - > filename ... Ctrl-d
+
* pico filename
+
Now you will learn a third way: using the curl program. Curl is useful when you know that a file exists on the internet at some URL. Curl stands for "Copy URL" - it allows you to copy a file at a particular address (URL) on the internet to your present working directory. I have placed a copy of the garli.conf file above at the following URL:
+
<nowiki>http://hydrodictyon.eeb.uconn.edu/eeb349/garli.conf</nowiki>
+
You can view it in a web browser if you wish. Here is how to get this file copied from that web address to your home directory on the cluster. I assume you have already logged into the cluster using PuTTY:
+
curl <nowiki>http://hydrodictyon.eeb.uconn.edu/eeb349/garli.conf</nowiki> > garli.conf
+
Curl acts a lot like cat: it basically spews the file to the console, so if you want to save it to a file, you need to redirect the output of curl to a file, which is what the <tt>> garli.conf</tt> part on the end is about.
+
 
+
==== Editing garli.conf with pico ====
+
 
+
Now fire up pico and edit this file:
+
pico garli.conf
+
You will only need to change two lines. Change this line
+
datafname = rana.phy
+
so that it looks like this instead
+
datafname = rbcl50.nex
+
Then change this line
+
ofprefix = ranaGarli
+
so that it looks like this instead
+
ofprefix = 50taxa
+
The ofprefix is used by GARLI to begin the name of all output files. I usually use something different than the data file name here. If you eventually want to delete all of the various files that GARLI creates, you can just say
+
rm -f 50taxa*
+
If, however, you specify <tt>ofprefix = rbcl50</tt>, then the command
+
rm -f rbcl50*
+
would not only wipe out the files GARLI created, but would also delete the data file!
+
 
+
Once you have finished changing those two lines, exit pico using Ctrl-x, then create a directory named garlirun and move garli.conf into that directory.
+
 
+
=== Download the data file using curl ===
+
 
+
I have placed the data file (rbcL50.nex) at the following address:
+
<nowiki>http://hydrodictyon.eeb.uconn.edu/eeb349/rbcL50.nex</nowiki>
+
so you can use curl to download this file to the garlirun directory as follows:
+
cd $HOME/garlirun
+
curl <nowiki>http://hydrodictyon.eeb.uconn.edu/eeb349/rbcL50.nex</nowiki> > rbcL50.nex
+
[[Image:new.png]] Three changes were made to this section:
+
* The last line above has been corrected (previously it did not include the <tt><nowiki>> rbcl50nex</nowiki></tt> part on the end)
+
* I changed <tt>rbcl50.nex</tt> to <tt>rbcL50.nex</tt> to avoid confusing the lower case letter L (l) for the number one (1)
+
* I substituted a Tomato sequence for one of the two Spinach sequences, so now there is no duplication in the rbcL50.nex data file (but note that the data set has changed)
+
 
+
=== Preparing the gogarli SGE script ===
+
 
+
Now return to your home directory (using the cd command) and create a gogarli script that will be fed to qsub to start the analysis. Use either pico or cat to create the file with this text:
+
#$ -o junk.txt -j y
+
cd $HOME/garlirun
+
Garli.94 garli.conf
+
This file will look very similar to the gopaup script you created in part A. The only difference is that the data and control file are in the directory garlirun (not pauprun), the name of the program is Garli.94 (this means Garli version 0.94), and GARLI expects the name of the control file (garli.conf) on the command line instead of the name of the data file (rbcl50.nex). Remember that the name of the data file was specified inside the control file.
+
 
+
=== Running GARLI ===
+
 
+
Run GARLI by issuing the qsub command:
+
qsub gogarli
+
 
+
Check progress every few minutes using the qstat command. This run will take 15 or 20 minutes. If you get bored, you can cd into the garlirun directory and use this command to see the tail end of the log file that GARLI creates automatically:
+
tail 50taxa.log00.log
+
The tail command is like the cat command except that it only shows you the last few lines of the file (which often is just what you need).
+
 
+
=== Mailing the tree to yourself ===
+
 
+
After GARLI has finished, you should download the tree file (50taxa.best.tre) using the PSFTP get command, but here is another handy trick: you can email the tree to yourself using this command (issue this from within the garlirun directory where the tree file is located):
+
mail paul.lewis@uconn.edu < 50taxa.best.tre
+
This command will send mail to paul.lewis@uconn.edu, and the body of the email message will come from the file 50taxa.best.tre!
+
 
+
[[Category:EEB courses]]
+
[[Category:Phylogenetics]]
+

Latest revision as of 19:04, 19 January 2018

Adiantum.png EEB 5349: Phylogenetics
The goal of this lab exercise is to show you how to log into the Bioinformatics Facility computer cluster and perform a basic PAUP* analysis.

Using the UConn Bioinformatics Facility cluster

The UConn Computational Biology Core is part of the Center for Genome Innovation (CGI). [http://compgenomics.lab.uconn.edu/ We will use the BBC computing cluster located in the main UITS data center for most of the data crunching we will do in this course. You by now should have an account on the cluster, and today you will learn how to start analyses remotely (i.e. from your laptop), check on their status, and download the results when your analysis is finished.

Obtaining the necessary communications software

You will be using a couple of simple (and free) programs to communicate with the head node of the cluster.

If you use MacOS 10.x...

SSH

The program ssh will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use ssh to send commands to the cluster and see the output generated.

Start by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive. Using the Terminal program, you can connect to the cluster with the following command:

ssh username@bbcsrv3.biotech.uconn.edu

where username should be replaced by your username on the cluster.

You may wish to install iTerm2, which is a terminal program that makes some things easier than Terminal, but the built-in Terminal will work just fine.

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create an alias for this command. Please ask your lab instructor for how to do this if you are interested.


SCP/SFTP

An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). I will show you how to transfer files using both methods, but for now you should go ahead and install Cyberduck. Cyberduck provides a nice graphical user interface, but I find that the command line clients let you get your work done faster once you get used to them.

If you use Windows...

SSH

The program PuTTY will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use PuTTY to send commands to the cluster and see the output generated.

Visit the PuTTY web site, scroll down to the section labeled "Binaries" and save both putty.exe and psftp.exe on your desktop (or use the MSI to download all the putty utilites as a bundle).

Double-click the PuTTY icon on your desktop to start the program. In the Host Name (or IP address) box, type bbcsrv3.biotech.uconn.edu. Now type Bioinformatics cluster into the Saved Sessions box and press the Save button. This will save having to type the computer's name each time you want to connect. Now click the Open button to start a session. The first time you connect, you will get a PuTTY Security Alert. Just press the Yes button to close this dialog.

Now you should see the following prompt:

login as:

Type in your username and press Enter. Now you should see the password prompt:

Password:

Type in your password and press Enter. If all goes well, you should see something like this:

Rocks 6.1 (Emerald Boa)
Profile built 08:40 02-Jan-2013

Kickstarted 04:02 02-Jan-2013
.
.
.
[plewis@bbcsrv3 ~]$

except that your username should appear instead of mine (plewis).

SCP/SFTP

An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). For now you should go ahead and install Cyberduck. Cyberduck provides a nice graphical user interface, which makes moving files back and forth easy.

Learning enough UNIX to get around

I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.

ls command: finding out what is in the present working directory

The ls command lists the files in the present working directory. Try typing just

ls

If you need more details about files than you see here, type

ls -la

instead. This version provides information about file permissions, ownership, size, and last modification date.

pwd command: finding out what directory you are in

Typing

pwd

shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

mkdir command: creating a new directory

Typing the following command will create a new directory named pauprun in your home directory:

mkdir pauprun

Use the ls command now to make sure a directory of that name was indeed created.

cd command: leaving the nest and returning home again

The cd command lets you change the present working directory. To move into the newly-created pauprun directory, type

cd pauprun

You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself

cd

If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:

cd ..

Creating run.nex using the nano editor

One way to create a new file, or edit one that already exists, is to use the nano editor. You will now use nano to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.

First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type

nano run.nex

This will open the nano editor, and it should say [ New File ] at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit nano.

For now, type the following into the editor:

#nexus

begin paup;
  log file=algae.output.txt start replace flush;
  execute algae.nex;
  set criterion=likelihood autoclose;
  lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
  hsearch swap=none start=stepwise addseq=random nrep=1;
  lscores 1;
  lset basefreq=previous tratio=previous shape=previous;
  hsearch swap=tbr start=1;
  savetrees file=algae.ml.tre brlens;
  log stop;
  quit;
end;

Once you have entered everything, use ^X to exit. Nano will ask if you want to save the modified buffer (a buffer is a predefined amount of computer memory used to store the text you type; the text stored in the buffer will be lost once you exit the program unless you save it to a file on the hard drive), at which point you should press the Y key to answer yes. Nano will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Nano should now exit and you can use cat to look at the contents of the file you just created:

cat run.nex

Create the gopaup file

Now use nano to create a second file named gopaup in your home directory (the parent directory of the pauprun directory). To do this, type nano gopaup. This file should contain this text:

#$ -S /bin/bash
#$ -o junk.txt -j y
cd $HOME/pauprun
module load paup/current
paup -n run.nex

Using Cyberduck to upload the algae.nex data file

Cyberduck bookmark.png

Download the file algae.nex from here and save it on your hard drive.

Open Cyberduck, choose Bookmark > New Bookmark from the main menu, then fill out the resulting dialog box as shown on the right (except substitute your own user name, of course). Be sure to change the protocol to SFTP (not the default FTP). Click the X button to close and you should see your bookmark appear at the bottom of the main window. Double click the bookmark to open a connection. You will then be warned that the host key is unknown - choose Allow (and go ahead and check the Always button so you do not need to do this every time.

Once you are in, you will see a listing of the files in your home directory (if you have any). To copy the algae.nex file to the cluster, you need only drag-and-drop it onto the Cyberduck window.

(Mac/Linux users only) Using scp to upload the algae.nex data file

Mac users have two options. While you will probably want to do your file transfers with Cyberduck as described above, it is also possible to transfer files using the command line scp client (scp is also a good option for Linux users). Read on if you are interested in this option, but feel free to skip this section if you are happy using Cyberduck.

Open the Terminal application and navigate to where you saved the file. If you saved it on the desktop, you can go there by typing cd Desktop.

Type the following to upload algae.nex to the cluster:

scp algae.nex username@bbcsrv3.biotech.uconn.edu:

where username should be replaced by your own user name on the cluster. (Don't forget the colon on the very end of the line!)

A few more UNIX commands

You have now transfered a large file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file is currently in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains a line containing the command execute algae.nex, which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:

 cd $HOME
 ls algae.*

Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters algae followed by a period and any number of other non-whitespace characters. The $HOME is a predefined shell variable that will be replaced with your home directory. It is not necessary in this case - typing cd all by itself would take you to your home directory - but the $HOME variable is good to know about (especially for use in scripts).

mv command: moving or renaming a file

Now use the mv command to move algae.nex to the directory pauprun:

mv algae.nex pauprun

The mv command takes two arguments. The first argument is the name of the directory or file you want to move, whereas the second argument is the destination. The destination could be either a directory (which is true in this case) or a file name. If the directory pauprun did not already exist, mv would have interpreted this as a request to rename algae.nex to the file name pauprun! So, be aware that mv can rename files as well as move them.

cp command: copying a file

The cp command copies files. It leaves the original file in place and makes a copy elsewhere. You could have used this command to get a copy of algae.nex into the directory pauprun:

cp algae.nex pauprun

This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.

rm command: cleaning up

The rm command removes files. If you had used the cp command to copy algae.nex into the pauprun directory, you could remove the original file using these commands:

cd
rm algae.nex

The first cd command just ensures that the copy you are removing will be the one in your home directory (typing cd by itself acts the same as typing cd $HOME). If it bothers you that the system always asks your permission before deleting a file, you can force the issue using the -f option (but just keep in mind that this is more dangerous):

rm -f algae.nex

To delete an entire directory (don't try this now!), you can add the -r flag, which means to recursively apply the remove command to everything in every subdirectory:

rm -rf pauprun

The above command would remove everything in the pauprun directory (without asking!), and then remove the pauprun directory itself. I want to stress that this is a particularly dangerous command, so make sure you are not weary or distracted when you use it! Unlike the Windows or Mac graphical user interface, files deleted using rm are not moved first to the Recycle Bin or Trash, they are just gone. There is no undo for the rm command.

Starting a PAUP* analysis

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your $HOME/pauprun directory on the cluster, whereas the gopaup file should be in $HOME. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:

#$ -S /bin/bash
#$ -o junk.txt -j y
cd $HOME/pauprun
module load paup/current
paup -n run.nex

This file will be used by software called the Sun Grid Engine (SGE for short) to start your run. SGE provides a command called qsub that you will use to submit your analysis. The SGE qsub command will look for a core (i.e. processor) on a node (i.e. machine) in the cluster that is currently not being used and will start your analysis on that node. This saves you the effort of looking amongst all nodes in the cluster for a core that is not busy.

Here is an explanation of each of the lines in gopaup:

  • Lines beginning with the two characters #$ are interpreted as commands by SGE itself. In this case, the first #$ command tells SGE to interpret what follows as a bash script and the second #$ command causes SGE to send any output from the program to a file named junk.txt, and the -j y part says to append any error output to this as well (the j stands for join and the y for yes)
  • The third line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
  • The fourth line sets environmental variables such that invoking paup starts the most recent installed version of the program. If you left this line out you would end up running an older version of paup.
  • The fifth and last line starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

Submitting a job using qsub

Now you are ready to start the analysis. Make sure you are in your home directory, then type

qsub gopaup

Checking status using qstat

You can see if your run is still going using the qstat command:

qstat

If it is running, you will see an entry containing gopaup and the state will be r (running), or qw (queued, waiting). Here is what it looked like for me:

[plewis@bbcsrv3 ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  24747 0.50500 gopaup     plewis       r     01/18/2014 21:30:02 all.q@compute-1-2.local            1

This indicates that my job is now running on the node named compute-1-2.

Killing a job using qdel

Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the qdel command for this. Note that in the output of the qstat command above, my run had a job-ID equal to 24747. I could kill the job like this:

qdel 24747

SGE will say that it has scheduled the job for deletion, but in practice it kills it almost instantaneously in my experience. Be sure to delete any output files that have already been created before starting your run over again.

While PAUP* is running

While PAUP* is running, you can use cat to look at the log file:

cd pauprun
cat algae.output.txt

Using Cyberduck to download the log file and the tree file

When PAUP* finishes, qstat will no longer list your process. At this point, you need to use Cyberduck to get the log and tree files that were saved back to your local computer. Assuming you left Cyberduck open and connected to the cluster, double-click on the pauprun directory and locate the files algae.ml.tre and algae.output.txt. Select these two files and drag them out of the Cyberduck window and drop them on your desktop. After a flurry of activity, Cyberduck should report that the two files were downloaded successfully, at which point you can close the download status window.

(Mac users only) Using scp to download the log file and the tree file

Mac users can also use scp to get the log and tree files that were saved back to your local computer, but, again, if you are happy with Cyberduck you can skip this section. In the Terminal app on your Mac, press Cmd-t to open a new tab, which will create a new session starting in your home directory on your Mac), then type the following (being careful to separate the final dot character from everything else by a blank space):

cd Desktop
scp username@bbcsrv3.biotech.uconn.edu:pauprun/algae.output.txt .
scp username@bbcsrv3.biotech.uconn.edu:pauprun/algae.ml.tre .

Again, be sure to replace username with your own user name on the cluster. The first command makes your current directory the Desktop folder, and the next two commands copy the files algae.output.txt and algae.ml.tre to your current directory (this is what the single dot at the end of each line stands for).

Using FigTree to view tree files

algae.ml.tre file viewed with FigTree

If you do not already have it, download and install the FigTree application on your laptop. FigTree is a Java application, so you will also need to install a Java Runtime Environment (JRE) if you don't already have one (just start FigTree and it will tell you if it cannot find a JRE). Once FigTree is running, choose File > Open from the menu to open the algae.ml.tre file.

Adjusting taxon label font

The first thing you will probably want to do is make the taxon labels larger or change the font. Expand the Tip Labels section on the left and play with the Font Size up/down control. You can also set font details in the preferences, which will save you a lot of time in the future

Line thickness

You can modify the thickness of the lines used by FigTree to draw the edges of the tree by expanding the Appearance tab.

Ladderization

You can ladderize the tree (make it appear to flow one way or the other) by playing with the Order Nodes option in the Trees tab.

Export tree as PDF

There are many other options that you can discover in FigTree, but one more thing to try today is to save the tree as a PDF file. Once you have the tree looking just the way you want, choose File > Export PDF...

Why both junk.txt and algae.output.txt?

In your home directory, SGE saved the output that PAUP* normally sends to the console to a file named junk.txt (we specified that it should do this in the gopaup file). I had you name this file junk.txt because you will not need this file after the run: the log command in your paup block ends up saving the same output in the file algae.output.txt. Why did we tell PAUP* to start a log file if SGE was going to save the output anyway? The main reason is that you can view the log file during the run, but you cannot view junk.txt until the run is finished. There will come a day when you have a PAUP* run that has been going for several days and want to know whether it is 10% or 90% finished. At this point you will appreciate being able to view the output file!

Delete junk.txt using the rm command

Because you do not need junk.txt, delete it using the rm command (the -f stands for force; i.e. don't ask if it is ok, just do it!):

cd
rm -f junk.txt

You also no longer need the log and tree files because you downloaded them to your local computer using PSFTP:

cd pauprun
rm -f algae.ml.tre
rm -f algae.output.txt

It is a good idea to delete files you no longer need for two reasons:

  • you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
  • our cluster only has so much disk space, and thus it is just not possible for everyone to keep every file they ever created

Tips and tricks

Here are some miscellaneous tips and tricks to make your life easier when communicating with the cluster.

Command completion using the tab key

You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and the command interpreter will try to complete the thought. For example, cd into the pauprun directory, then type

cat alg<TAB>

If algae.nex is the only file in the directory in which the first three letters are alg, then the command interpreter will type in the rest of the file name for you.

Wildcards

I've already mentioned this tip, but it bears repeating. When using most UNIX commands that accept filenames (e.g. ls, rm, mv, cp), you can place an asterisk inside the filename to stand in for any number of letters. So

ls algae*

will produce output like this

algae.ml.tre    algae.nex   algae.output.txt

Man pages

If you want to learn more options for any of the UNIX commands, you can use the man command to see the manual for that command. For example, here's how to see the manual describing the ls command:

man ls

It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.