Difference between revisions of "Phylogenetics: Python Primer"

From EEBedia
Jump to: navigation, search
Line 8: Line 8:
  
 
== What is Python? ==
 
== What is Python? ==
[http://www.python.org/ Python] is one of two programming languages that we will use this semester (the other being [http://www.r-project.org/ R]). One might make the case that programs like PAUP that are associated with a unique command language also represent programming languages; however, Python and R are different in being general purpose (i.e. not written specifically for phylogenetics).
+
[[Image: python_logo.gif|right]] [http://www.python.org/ Python] is one of two programming languages that we will use this semester (the other being [http://www.r-project.org/ R]). One might make the case that programs like PAUP that are associated with a unique command language also represent programming languages; however, Python and R are different in being general purpose (i.e. not written specifically for phylogenetics).
  
 
Python is one of a number of different high-level computing languages in common use. All of the software we use is written in one of these languages. PAUP is written entirely in the C language, while SplitsTree is written in Java. Knowing a little bit of computer programming can save you immense amounts of time by allowing you to automate things. You will being realizing these savings doing homework for this class. While it is possible to do all the homework assignments by hand using a calculator, you will find that using Python will save you time and is more accurate (a mistake on a calculator early on in a calculation can be very costly in terms of time and accuracy). Python is a good language to learn first because it is relatively simple (not many words or punctuation rules to learn) and is much more forgiving than other languages. It is in a class of languages known as scripting languages because the program is interpreted as it is read - languages such as C require two additional steps (compiling and linking) before they can be run.
 
Python is one of a number of different high-level computing languages in common use. All of the software we use is written in one of these languages. PAUP is written entirely in the C language, while SplitsTree is written in Java. Knowing a little bit of computer programming can save you immense amounts of time by allowing you to automate things. You will being realizing these savings doing homework for this class. While it is possible to do all the homework assignments by hand using a calculator, you will find that using Python will save you time and is more accurate (a mistake on a calculator early on in a calculation can be very costly in terms of time and accuracy). Python is a good language to learn first because it is relatively simple (not many words or punctuation rules to learn) and is much more forgiving than other languages. It is in a class of languages known as scripting languages because the program is interpreted as it is read - languages such as C require two additional steps (compiling and linking) before they can be run.

Revision as of 17:02, 4 February 2009

Adiantum.png EEB 349: Phylogenetics
This lab represents an introduction to the Python computing language, which will be useful for upcoming homework assignments as well as for software written as Python extensions (i.e. Phycas)

What is Python?

Python logo.gif
Python is one of two programming languages that we will use this semester (the other being R). One might make the case that programs like PAUP that are associated with a unique command language also represent programming languages; however, Python and R are different in being general purpose (i.e. not written specifically for phylogenetics).

Python is one of a number of different high-level computing languages in common use. All of the software we use is written in one of these languages. PAUP is written entirely in the C language, while SplitsTree is written in Java. Knowing a little bit of computer programming can save you immense amounts of time by allowing you to automate things. You will being realizing these savings doing homework for this class. While it is possible to do all the homework assignments by hand using a calculator, you will find that using Python will save you time and is more accurate (a mistake on a calculator early on in a calculation can be very costly in terms of time and accuracy). Python is a good language to learn first because it is relatively simple (not many words or punctuation rules to learn) and is much more forgiving than other languages. It is in a class of languages known as scripting languages because the program is interpreted as it is read - languages such as C require two additional steps (compiling and linking) before they can be run.

Installing Python


Mac logo.jpg
If you have a Mac, then Python is already installed on your computer because it is used in many parts of the Mac OS X operating system. To verify this, and to see what version of Python you have, start your Terminal program (in the Applications/Utilities folder), then type the following at the unix prompt (note: the prompt is represented by a dollar sign ($), so you should only type the "python -V" part):
$ python -V
Python 2.5.1

Win logo.png
If you have Windows, you will need to download and install Python from the http://www.python.org/download/ web site. Note that there are two current versions of Python. You should install the older one (Python 2.6.1). (We would use Python 3.0 except that some Python software that we will use later in the semester is not yet compatible with Python 3.0, so it makes sense to stick with 2.6 for now.) There are several different types of downloads: choose Python 2.6.1 Windows installer (Windows binary -- does not include source) and follow the directions once you start it up.

Once you have Python installed, you can invoke it from a console window (in Vista, use Start, All Programs, Accessories, Command Prompt) by typing python. The -V option causes Python to print out its version number and then quit:

C:\Users\Administrator>python -V
Python 2.6.1

You should download the documentation from http://docs.python.org/download.html so that it can be accessed quickly. I find the HTML form best: I unpacked the zip file and bookmarked it in my browser as follows:

file:///Users/plewis/Documents/Manuals/python-html/index.html

Python basics

This is the briefest of introductions, designed to get you just to the point where you can do the majority of your homework assignments using Python. If you get stuck trying to write a Python program, I have found the Tutorial and Global Index to be the most useful parts of the Python documentation.

Kinds of information you can store in Python variables

Try typing in the Python code presented in the example sessions below. The >>> represents the Python prompt (you will see this when you start Python), so don't type that! The output is shown below each Python statement that generates output (don't type that either!). Finally, everything after a # character represents a comment (while you can type these in, it would probably be a waste of your time).

Integers, floats and strings

Integers are whole positive or negative numbers. Floats are numbers with an implicit or explicit decimal point. Strings are series of characters (i.e. words or sentences).

>>> s = 'Have a nice day'       # assign the string 'Have a nice day' to the variable s
>>> i = 9                       # assign the integer 9 to the variable i (you get to choose the variable names)
>>> f = 9.5                     # assign the float 9.5 to the variable f
>>> x = int('5')                # create an integer out of a string
>>> x                           # a variable name on a line by itself causes its value to be printed
5
>>> x = int('5.5')              # this will lead to an error because integers are whole numbers
Traceback (most recent call last):
...
ValueError: invalid literal for int() with base 10: '5.5'
>>> y = float('5.5')            # create a float out of a string
>>> y
5.5

Lists and tuples

Lists are collections of integers, floats, strings, etc. Tuples are like lists, except that you cannot change elements of a tuple.

>>> L = [i, f, s]               # a list consisting of the integer, the float and the string we just defined
>>> t = (i, f, s)               # a tuple consisting of the integer, the float and the string we just defined
>>> L
[9, 9.5, 'Have a nice day']
>>> L[0] = 5                    # change the first value in the list to 5
>>> L[-1] = 'Ok'                # change the last value in the list to "Ok"
>>> L                           # show the list L
[5, 9.5, 'Ok']
>>> t[0] = 5                    # try changing a tuple (this will generate an ugly error message)
Traceback (most recent call last):
    ...
TypeError: 'tuple' object does not support item assignment
>>> t                           # show the tuple t (see, no change was made)
(9, 9.5, 'Have a nice day')

Dictionaries

Finally, a dictionary is a special kind of list in which the "keys" can be almost any Python variable (e.g. integers, floats, strings, tuples)

>>> d = {'lnL': -6955.43251, 'model': 'HKY85', 'ncycles': 1000}
>>> d['model']
'HKY85'

Using what you know

You already know enough Python to start using it to do your homework for this week.

Computing p-distances from numbers of differences

The first step in your homework for this week is to compute p-distances from the supplied numbers of differences. Here is one way to do this in Python:

>>> n = 3424   # store sequence length in variable n
>>> x = 293    # store number of differences between taxon1 and taxon2 in x
>>> p = x/n    # do the division
>>> p
0

Ok, you know that's not the right answer, so what went wrong? We divided one integer by another integer to produce a third integer, but instead of storing the number 0.085572429906542055 in p, it stored 0 because that is the whole number part of 0.085572429906542055. It is always a good idea to be explicit when you might get a float for an answer:

>>> p = float(x)/float(n)
>>> p
0.085572429906542055

The following solution saves some typing and will prevent you from being surprised by later calculations involving n and/or x:

>>> n = 3424.0
>>> x = 293.0
>>> p = x/n
>>> p
0.085572429906542055

This time p is a float because you divided one float by another.

Computing a JC distance from a p-distance

The next step is to convert your p-distance to a JC distance. Now we need some capabilities (e.g. the ability to take the logarithm of a number) that are not present by default when Python starts up. To solve this deficiency, use the import statement to bring in the needed functionality:

>>> import math
>>> jc = -0.75*math.log(1.0 - 4.0*p/3.0)
>>> jc
0.090860500068600705

Loops

An important component of any programming language is the ability to do the same sort of thing many times. In the session that follows, you will create a list containing all six pairwise differences among sequences, then use a for loop to compute the JC distance for all six. Important: use a tab to indent. Python is very sensitive about indenting. If the three indented lines are not indented by exactly the same amount, Python will spit out an error message.

>>> n = 3424.     # note that you really just need the period to make it into a float
>>> x = [293.0, 277.0, 328.0, 268.0, 353.0, 353.0]
>>> jc = []       # start with an empty list
>>> for i in range(6):
...     p = x[i]/n                           # compute p-distance using the ith value in the x list
...     d = -0.75*math.log(1.0 - 4.0*p/3.0)  # compute the JC distance and store in variable d
...     jc.append(d)                         # lengthen the list jc by appending the new value d
...                                          # just hit return here to exit the loop body
>>> jc
[0.090860500068600705, 0.085604236767267153, 0.1024886399705747, 0.082663697169458011, 0.11090623674796182, 0.11090623674796182]

Python script files

Now that your Python constructs are getting a little longer, it is a good time to learn about creating files for your Python programs. A file containing Python statements is called a Python script. It is a good idea to have a good text editor before starting to create scripts, so your next goal will be to download one if you do not already have one installed.

Note: Microsoft Word is not a good text editor! It is an excellent word processor, but text editors and word processors are different beasts. Text editors always save files as plain text, while Word saves the file in its proprietary file format that Python cannot read. It is possible to save Word files as plain text, but usually this is more trouble than it is worth.

Install a text editor

Fortunately, there are free (and really good) text editors for both Windows and Macs.


Mac logo.jpg If you have a Mac, download TextWrangler. If you really get into scripting, you may want to purchase the beefier version known as BBEdit, produced by the same company, but TextWrangler will suffice for this course. Once you get TextWrangler installed, start it up and create a file with the contents shown below (after the instructions for Windows users). Save the file using the name first.py in a convenient location (e.g. Documents/scripts), then navigate to that folder in your Terminal window (using the command cd $HOME/Documents/scripts)


Win logo.png If you have Windows, download Notepad++. Once you get Notepad++ installed, start it up and create a file with the contents shown below. Save the file using the name first.py in a convenient location (e.g. C:\scripts), then navigate to that folder in your command console (using the command cd C:\scripts)


Your first Python script

Here is what you should type into your new first.py file:

import math
n = 3424.
x = [293.0, 277.0, 328.0, 268.0, 353.0, 353.0]
jc = []
for i in range(6):
    p = x[i]/n
    d = -0.75*math.log(1.0 - 4.0*p/3.0)
    jc.append(d)
print jc

Note that I have added the word print on the last line. This is because our little trick of getting Python to tell us the value of a variable by placing the variable's name on a line by itself only works when you are using Python interactively. We are now switching to programming mode, where an entire script is given to the Python interpreter, and a print statement must be used in this context. The result should be the same.

Try running your script by typing the following at your operating system prompt (i.e. get out of Python if you are already in it by typing Ctrl-d if you are a Mac user or Ctrl-z if you are a Windows user):

python first.py

Fancier print statements

The program first.py spits out six JC distance values, but it is usually good to format the output so that it is clearer than this. Copy the following into a new text file named second.py and try running it:

import math
n = 3424.
x = [293.0, 277.0, 328.0, 268.0, 353.0, 353.0]
for i in range(6):
    p = x[i]/n
    d = -0.75*math.log(1.0 - 4.0*p/3.0)
    print '%12d %12.5f %12.8f' % (i, p, d)

You will note a couple of differences between first.py and second.py. First, there is no jc list anymore, we just print out values as they are computed. Second, the print statement is much more complicated now! The complexity might be daunting at first, but you will quickly become adjusted to it I think. Here is how the print statement used in second.py works. There are two parts separated by a percent symbol. The first part is a string:

'%12d %12.5f %12.8f'

while the second part is a tuple:

(i, p, d)

The string serves as a format specification. Here is a breakdown:

  • %12d says print an integer (d is code for integer here) using 12 spaces
  • %12.5f says print a float (f is code for float here) using 12 spaces total, with 5 of the 12 being devoted to the part after the decimal point
  • %12.8f says print another float using 12 spaces, this time with 8 of the 12 being after the decimal point

The tuple provides the values to insert: the integer stored in the variable i will go in the first spot, while the floats stored in p and d will go in the second and third spots. Be sure to use "d" for integers and "f" for floats in your format string, otherwise Python will complain.

Run second.py and see if Python spaced everything as you expected. Note that the second float will actually take up 13 spaces because there is a single space in the format string just before the %12.5f specification. Spaces in the format string are inserted as is into the print output.