Phylogenetics: Bioinformatics Cluster

From EEBedia
Revision as of 15:04, 18 February 2007 by PaulLewis (Talk | contribs) (Programs vs. protocols)

Jump to: navigation, search
Under construction.png This article is still under construction.
Expect it to change frequently until this notice is removed.
Adiantum.png EEB 349: Phylogenetics
The goal of this lab exercise is to show you how to use the Bioinformatics Facility computer cluster to run PAUP* and GARLI.

Part A: Using the UConn Bioinformatics Facility cluster

The Bioinformatics Facility is part of the UConn Biotechnology Center, which is located behind the Up-N-Atom Cafe in the lower level of the Biology/Physics building. Jeff Lary maintains a 17-node Apple Xserve G5 Cluster that can be used by UConn graduate students and faculty to conduct bioinformatics-related research (sequence analysis, biological database searches, phylogenetics, molecular evolution). You have each been given accounts on the cluster, and today you will learn how to start analyses remotely (i.e. from this computer lab), check on their status, and download the results when your analysis is finished.

Obtaining the necessary communications software

You will be using a couple of simple (and free) programs to communicate with the head node of the cluster. Visit the PuTTY web site, scroll down to the section labeled "Binaries" and save putty.exe and psftp.exe on your desktop.

PuTTY

The program PuTTY will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell) that encrypts everything sent over the internet. You will use PuTTY to send commands to the cluster and see the output generated. In the old days, a protocol known as Telnet was used for this purpose, but it is no longer used because it did not encrypt anything, making it easy for someone with access to the network to see your username and password in plain text.

PSFTP

The other program you will use is called PSFTP. It allows you to transfer files back and forth using SFTP (Secure File Transfer Protocol). It replaces the old protocol (FTP) that, like Telnet, sent usernames and passwords unencrypted across the network.

Programs vs. protocols

SSH and SFTP are protocols, not programs. PuTTY and PSFTP are programs that implement the SSH and SFTP protocols, respectively. In a little while from now, you may be thinking "I liked FTP much better than SFTP!" because you probably used user-friendly, graphical FTP programs in the past. If you think this, then you are confusing protocols with programs that implement them. There are much fancier, easy-to-use programs for using SSH and SFTP than PuTTY and PSFTP, but these will serve us well today. One nice thing is that these programs are so small that you can just download them whenever and whereever you happen to need them. If you find yourself wanting a fancier SFTP client, check out FileZilla (on Windows) or Fugu (for Macs).

Logging in for the first time

On the whiteboard you will find your login id (user name) and password.

Double-click the PuTTY icon on your desktop to start the program. In the Host Name (or IP address) box, type bbcxsrv1.biotech.uconn.edu. Now type Bioinformatics cluster into the Saved Sessions box and press the Save button. This will save having to type the computer's name each time you want to connect. Now click the Open button to start a session.

The first time you connect, you will get a PuTTY Security Alert. Just press the Yes button to close this dialog.

Now you should see the following prompt:

login as:

Type in your username and press Enter. Now you should see the password prompt:

Password:

Type in your password and press Enter. If all goes well, you should see something like this:

Welcome to Darwin!
[bbcxsrv1:~] plewis%

except that your username should appear instead of mine (plewis).

The first thing you should do is change your password. Type

passwd

and press the Enter key, then follow the directions to change your password. If you have trouble thinking up passwords that are acceptable, check out the Java Password Generator web site. It generates passwords that are not really words but sound like they are, so they are easier to remember than completely random passwords.

Learning enough UNIX to get around

I'm presuming that you do not know a lot of UNIX commands. If you are already a UNIX guru, you can skip to the next section. UNIX is the operating system upon which MacOSX is built. You are actually communicating with a MacIntosh G5 computer running MacOSX, but you will be using the command console rather than menus today.

ls command: finding out what is in the present working directory

The ls command lists the files in the present working directory. Try typing just

ls

If you need more details about files than you see here, type

ls -la

instead. This version provides information about file permissions, ownership, size, and last modification date.

pwd command: finding out what directory you are in

Typing

pwd

shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

mkdir command: creating a new directory

Typing the following command will create a new directory named pauprun in your home directory:

mkdir pauprun

Use the ls command now to make sure a directory of that name was indeed created.

cd command: leaving the nest and returning home again

The cd command lets you change the present working directory. To move into the newly-created pauprun directory, type

cd pauprun

You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself

cd

Use cd now to return to your home directory.

cat command: creating and viewing files

The cat command was designed for concatenating files, but I most often use it for viewing and creating files. To create a new file named gopaup, type the following (be sure to leave a space between each item, just as you see it below) and then press the Enter key

cat - > gopaup

Note that you no longer see the unix prompt, and the system appears to be hung. This is ok! The text you typed is admittedly somewhat cryptic:

  • The hyphen (-) after the word cat means "use text typed from the console"
  • The greater-than symbol (>) means "redirect the output to a file"
  • The gopaup part is the name of the file to which the output will be redirected

The cat command is now waiting for you to type something. Type the following and, when finished, press the Ctrl-d key combination to tell cat that you are done:

#$ -o junk.txt -j y
cd $HOME/pauprun
paup -n run.nex

You can now use the cat command again to view the contents of the file you just created:

cat gopaup

Basically, cat just spews out the contents of whatever you give it to work with. If you give it a file name, it spews out the contents of the file. If you give it a hyphen, it reads text you type until you press Ctrl-d, then it spews that text out again. In your case, when you used the hyphen, you also told it to redirect its output to the file gopaup, so that's why you did not see what it spewed.

rm command: getting rid of files you no longer want

mv command: moving or renaming a file

cp command: copying a file

The pico editor

Using PSFTP to upload files

Locate the file algae.nex that we used in the previous lab. If you have deleted it, you will need to download and save it again. Now create a run.nex file using the following text:

#nexus

begin paup;
  execute algae.nex;
  set criterion=likelihood autoclose;
  lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
  hsearch swap=none start=stepwise addseq=random nrep=1;
  lset basefreq=previous tratio=previous shape=previous;
  hsearch swap=tbr start=1;
  savetrees file=algae.ml.tre brlens;
end;

Make sure that both algae.nex and run.nex are in the same place as the PSFTP program, then start PSFTP by double-clicking it.

PSFTP should say something like this:

psftp: no hostname specified; use "open host.name" to connect

To open a connection to the cluster, type

open bbcxsrv1.biotech.uconn.edu

then supply your username and password when prompted.

To upload algae.nex to the cluster, type

put algae.nex

Now upload run.nex to the cluster by typing

put run.nex

If you do not see any error messages, then you can assume that the transfers worked. Type

quit

to exit the PSFTP program.

Starting a PAUP* analysis

Sun Grid Engine

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your home directory on the cluster. Use the cat command to look at the contents of the gopaup file you created earlier. You should see this:

#$ -o junk.txt -j y
cd $HOME/pauprun
paup -n run.nex

This file will be used by software called the Sun Grid Engine (SGE for short) to start your run. SGE provides a command called qsub that you will use to submit your analysis. SGE will then look for a node (i.e. machine) in the cluster that is currently not being used (or is being used to a lesser extent than other nodes) and will start your analysis on that node. This saves you the effort of looking amongst all 17 nodes in the cluster for one that is not busy.

Here is an explanation of each of the lines in gopaup:

  • Lines beginning with the two characters #$ are interpreted as commands by SGE itself. In this case, the command tells SGE to send any output from the program to a file named junk.txt and the -j y part says to append any error output to this as well (the j stands for join and the y for yes)
  • The second line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
  • The third and last line simply starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

Now you are ready to start the analysis. Make sure you are in your home directory, then type

qsub gopaup

You can see if your run is still going using the qstat command:

qstat

If it is running, you will see an entry containing gopaup and the status will be r, for running.

While PAUP* is running, you can use cat to look at the output, which SGE has been saving in the junk.txt file as instructed:

cat junk.txt

Part B: Starting a PAUP* run on the cluster

Part C: Starting a GARLI run on the cluster