Phylogenetics: Bioinformatics Cluster

From EEBedia
Revision as of 15:59, 18 February 2007 by PaulLewis (Talk | contribs) (Starting a PAUP* analysis)

Jump to: navigation, search
Under construction.png This article is still under construction.
Expect it to change frequently until this notice is removed.
Adiantum.png EEB 349: Phylogenetics
The goal of this lab exercise is to show you how to use the Bioinformatics Facility computer cluster to run PAUP* and GARLI.

Part A: Using the UConn Bioinformatics Facility cluster

The Bioinformatics Facility is part of the UConn Biotechnology Center, which is located behind the Up-N-Atom Cafe in the lower level of the Biology/Physics building. Jeff Lary maintains a 17-node Apple Xserve G5 Cluster that can be used by UConn graduate students and faculty to conduct bioinformatics-related research (sequence analysis, biological database searches, phylogenetics, molecular evolution). You have each been given accounts on the cluster, and today you will learn how to start analyses remotely (i.e. from this computer lab), check on their status, and download the results when your analysis is finished.

Obtaining the necessary communications software

You will be using a couple of simple (and free) programs to communicate with the head node of the cluster. Visit the PuTTY web site, scroll down to the section labeled "Binaries" and save putty.exe and psftp.exe on your desktop.

PuTTY

The program PuTTY will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell) that encrypts everything sent over the internet. You will use PuTTY to send commands to the cluster and see the output generated. In the old days, a protocol known as Telnet was used for this purpose, but it is no longer used because it did not encrypt anything, making it easy for someone with access to the network to see your username and password in plain text.

PSFTP

The other program you will use is called PSFTP. It allows you to transfer files back and forth using SFTP (Secure File Transfer Protocol). It replaces the old protocol (FTP) that, like Telnet, sent usernames and passwords unencrypted across the network.

Programs vs. protocols

SSH and SFTP are protocols, not programs. PuTTY and PSFTP are programs that implement the SSH and SFTP protocols, respectively. In a little while from now, you may be thinking "I liked FTP much better than SFTP!" because you probably used user-friendly, graphical FTP programs in the past. If you think this, then you are confusing protocols with programs that implement them. There are much fancier, easy-to-use programs for using SSH and SFTP than PuTTY and PSFTP, but these will serve us well today. One nice thing is that these programs are so small that you can just download them whenever and whereever you happen to need them. If you find yourself wanting a fancier SFTP client, check out FileZilla (on Windows) or Fugu (for Macs).

Logging in for the first time

On the whiteboard you will find your login id (user name) and password.

Double-click the PuTTY icon on your desktop to start the program. In the Host Name (or IP address) box, type bbcxsrv1.biotech.uconn.edu. Now type Bioinformatics cluster into the Saved Sessions box and press the Save button. This will save having to type the computer's name each time you want to connect. Now click the Open button to start a session. The first time you connect, you will get a PuTTY Security Alert. Just press the Yes button to close this dialog.

Now you should see the following prompt:

login as:

Type in your username and press Enter. Now you should see the password prompt:

Password:

Type in your password and press Enter. If all goes well, you should see something like this:

Welcome to Darwin!
[bbcxsrv1:~] plewis%

except that your username should appear instead of mine (plewis).

The first thing you should do is change your password. Type

passwd

and press the Enter key, then follow the directions to change your password. If you have trouble thinking up passwords that are acceptable, check out the Java Password Generator web site. It generates passwords that are not really words but sound like they are, so they are easier to remember than completely random passwords.

Learning enough UNIX to get around

I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.

The cluster comprises MacIntosh G5 computers running MacOSX, but MacOSX is essentially a UNIX operating system with a very nice user interface. But today you will not be using the nice user interface! Instead, you will be communicating using the UNIX command console, so the first step is to learn a few important UNIX commands. You can use these commands on any MacIntosh running MacOSX by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive.

ls command: finding out what is in the present working directory

The ls command lists the files in the present working directory. Try typing just

ls

If you need more details about files than you see here, type

ls -la

instead. This version provides information about file permissions, ownership, size, and last modification date.

pwd command: finding out what directory you are in

Typing

pwd

shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

mkdir command: creating a new directory

Typing the following command will create a new directory named pauprun in your home directory:

mkdir pauprun

Use the ls command now to make sure a directory of that name was indeed created.

cd command: leaving the nest and returning home again

The cd command lets you change the present working directory. To move into the newly-created pauprun directory, type

cd pauprun

You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself

cd

cat command: creating and viewing files

The cat command was designed for concatenating files, but I most often use it for viewing and creating files. To create a new file named gopaup, type the following (be sure to leave a space between each item, just as you see it below) and then press the Enter key

cat - > gopaup

Note that you no longer see the unix prompt, and the system appears to be hung. This is ok! The text you typed is admittedly somewhat cryptic:

  • The hyphen (-) after the word cat means "use text typed from the console"
  • The greater-than symbol (>) means "redirect the output to a file"
  • The gopaup part is the name of the file to which the output will be redirected

The cat command is now waiting for you to type something. Type the following and, when finished, press the Ctrl-d key combination to tell cat that you are done:

#$ -o junk.txt -j y
cd $HOME/pauprun
paup -n run.nex

If you make mistakes while typing, don't fear! You can fix them later using the pico editor. Use the cat command again to view the contents of the file you just created:

cat gopaup

Basically, cat just spews out the contents of whatever you give it to work with. If you give it a file name, it spews out the contents of the file. If you give it a hyphen, it reads text you type until you press Ctrl-d, then it spews that text out again. In your case, when you used the hyphen, you also told it to redirect its output to the file gopaup, so that's why you did not see what it spewed.

The pico editor

Another way to create a new file, or edit one that already exists, is to use the pico editor. Most people like using pico better than cat for creating new files: the only advantage cat has over pico is that it is guaranteed to be present on every UNIX computer, whereas pico is only present on some. You will now use pico to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.

First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type

pico run.nex

This will open the pico editor, and it should say [ New File ] at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit pico.

For now, type the following into the editor:

#nexus

begin paup;
  execute algae.nex;
  set criterion=likelihood autoclose;
  lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
  hsearch swap=none start=stepwise addseq=random nrep=1;
  lset basefreq=previous tratio=previous shape=previous;
  hsearch swap=tbr start=1;
  savetrees file=algae.ml.tre brlens;
end;

Once you have entered everything, use ^X to exit. Pico will ask if you want to save the modified buffer, at which point you should press the Y key to answer yes. Pico will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Pico should now exit and you can use cat to look at the contents of the file you just created:

cat run.nex

Using PSFTP to upload files

Locate the file algae.nex that we used in the previous lab. If you have deleted it, you will need to download and save it again.

Make sure that algae.nex is in the same place as the PSFTP program, then start PSFTP by double-clicking it.

PSFTP should say something like this:

psftp: no hostname specified; use "open host.name" to connect

To open a connection to the cluster, type

open bbcxsrv1.biotech.uconn.edu

then supply your username and password when prompted.

To upload algae.nex to the cluster, type

put algae.nex

If you do not see any error messages, then you can assume that the transfer worked. Type

quit

to exit the PSFTP program.

A few more UNIX commands

You have now transfered a large file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file should be in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains this line

execute algae.nex

which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:

 cd $HOME
 ls algae.*

Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters algae followed by a period and any number of other non-whitespace characters.

mv command: moving or renaming a file

Now use the mv command to move algae.nex to the directory pauprun:

mv algae.nex pauprun

The mv command takes two arguments. The first argument is the name of the directory or file you want to move, whereas the second argument is the destination. The destination could be either a directory (which is true in this case) or a file name. If the directory pauprun did not already exist, mv would have interpreted this as a request to rename algae.nex to the file name pauprun! So, be aware that mv can rename files as well as move them.

cp command: copying a file

The cp command copies files. It leaves the original file in place and makes a copy elsehwere. You could have used this command to get a copy of algae.nex into the directory pauprun:

cp algae.nex pauprun

This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.

rm command: cleaning up

The rm command removes files. If you had used the cp command to copy algae.nex into the pauprun directory, you could remove the original file using these commands:

cd
rm algae.nex

The first cd command just ensures that the copy you are removing will be the one in your home directory (typing cd by itself acts the same as typing cd $HOME). If it bothers you that the system always asks your permission before deleting a file, you can force the issue using the -f option (but just keep in mind that this is more dangerous):

rm -f algae.nex

To delete an entire directory (don't try this now!), you can add the -r flag, which means to recursively apply the remove command to everything in every subdirectory:

rm -rf pauprun

The above command would remove everything in the pauprun directory (without asking!), and then remove the pauprun directory itself. I want to stress that this is a particularly dangerous command, so make sure you are not weary or distracted when you use it! Unlike the Windows or Mac graphical user interface, files deleted using rm are not moved first to the Recycle Bin or Trash, they are just gone. There is no undo for the rm command.

Starting a PAUP* analysis

Sun Grid Engine

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your $HOME/pauprun directory on the cluster, whereas the gopaup file should be in $HOME. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:

#$ -o junk.txt -j y
cd $HOME/pauprun
paup -n run.nex

This file will be used by software called the Sun Grid Engine (SGE for short) to start your run. SGE provides a command called qsub that you will use to submit your analysis. SGE will then look for a node (i.e. machine) in the cluster that is currently not being used (or is being used to a lesser extent than other nodes) and will start your analysis on that node. This saves you the effort of looking amongst all 17 nodes in the cluster for one that is not busy.

Here is an explanation of each of the lines in gopaup:

  • Lines beginning with the two characters #$ are interpreted as commands by SGE itself. In this case, the command tells SGE to send any output from the program to a file named junk.txt and the -j y part says to append any error output to this as well (the j stands for join and the y for yes)
  • The second line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
  • The third and last line simply starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

Now you are ready to start the analysis. Make sure you are in your home directory, then type

qsub gopaup

You can see if your run is still going using the qstat command:

qstat

If it is running, you will see an entry containing gopaup and the status will be r, for running.

While PAUP* is running, you can use cat to look at the output, which SGE has been saving in the junk.txt file as instructed:

cat junk.txt

Part B: Starting a PAUP* run on the cluster

Part C: Starting a GARLI run on the cluster