Difference between revisions of "Phylogenetics: R Primer"

From EEBedia
Jump to: navigation, search
Line 16: Line 16:
 
Install the R package by going to http://www.r-project.org/ and following the download instructions for your platform.
 
Install the R package by going to http://www.r-project.org/ and following the download instructions for your platform.
  
== Plotting the gamma probability density function ==
+
== Gamma distribution ==
  
 
As an introduction to R, let's use it to explore the gamma distribution. The gamma distribution has two parameters that govern its shape and scale. Here are some basic facts about gamma distribuitons:
 
As an introduction to R, let's use it to explore the gamma distribution. The gamma distribution has two parameters that govern its shape and scale. Here are some basic facts about gamma distribuitons:
Line 44: Line 44:
  
 
When the gamma distribution is used to model rate heterogeneity in phylogenetics, its mean is set to 1 because it is being used to model ''relative'' rates. In this application, only the shape parameter <math>\alpha</math> is ever specified, and that is because the scale parameter <math>\beta</math> is determined by the constraint that the mean is 1.0.
 
When the gamma distribution is used to model rate heterogeneity in phylogenetics, its mean is set to 1 because it is being used to model ''relative'' rates. In this application, only the shape parameter <math>\alpha</math> is ever specified, and that is because the scale parameter <math>\beta</math> is determined by the constraint that the mean is 1.0.
 +
 
<div style="background-color: #ccccff;">What is the value of <math>\beta</math> if <math>\alpha</math> is known and the mean must be 1.0?</div>
 
<div style="background-color: #ccccff;">What is the value of <math>\beta</math> if <math>\alpha</math> is known and the mean must be 1.0?</div>
 +
 +
=== Creating a Histogram ===
 +
 +
Let's begin by drawing 1000 random numbers (deviates) from a gamma distribution (shape=0.5, scale=2.0) and creating a histogram. Type the following command into the R console to generate the random deviates and store them in the variable <tt>x</tt>:
 +
x = rgamma(1000, shape=0.5, scale=2.0)
 +
It may appear that nothing has happened, but you can view the 1000 values stored in the variable <tt>x</tt> by typing <tt>x</tt> at the prompt.
 +
 +
Creating a basic histogram from these 1000 values is easy:
 +
hist(x)
 +
Now refine the histogram by telling R that you want there to be 50 bars:
 +
hist(x, breaks=50)
 +
 +
<div style="background-color: #ccccff;">Now you create a histogram with 40 bars from 10000 deviates from a Gamma(shape=10, scale=0.1) distribution</div>

Revision as of 22:37, 15 March 2009

Adiantum.png EEB 349: Phylogenetics
This lab represents an introduction to the R computing language, which will be useful for upcoming homework assignments as well as for software written as R extensions (i.e. APE)

What is R?

Rlogo.jpg
R is one of two programming languages that we will use this semester (the other being Python). One might make the case that programs like PAUP that are associated with a unique command language also represent programming languages; however, R and Python are different in being general purpose (i.e. not written specifically for phylogenetics).

While R and Python share many similarities, R excels in tasks that involve statistics and creating plots. While a spreadsheet program like Excel can create charts, it is often easier to create the plot you need in R. The purpose of this lab is to introduce you to R by showing you how to make various plots related to probability distributions. This will also enable you to learn about several important probability distributions that serve as prior distributions in Bayesian phylogenetics.

Installing R

Install the R package by going to http://www.r-project.org/ and following the download instructions for your platform.

Gamma distribution

As an introduction to R, let's use it to explore the gamma distribution. The gamma distribution has two parameters that govern its shape and scale. Here are some basic facts about gamma distribuitons:

Shape parameter \alpha
Scale parameter \beta
Mean \alpha \; \beta
Variance \alpha \; \beta^2
Density function f(x) = \frac{x^{\alpha - 1} e^{\alpha/\beta}}{\beta^{\alpha} \Gamma(\alpha)}

Note that that \Gamma(\alpha) in the density function refers to the gamma function (which is distinct from the gamma distribution). The gamma function is easy to calculate when its argument is a positive integer:

\Gamma(x) = (x-1)!

When, however, the argument is an arbitrary positive real number (which is the usual situation for \alpha), then computing it gets more complicated. Fortunately, R handles details like this for us!

When the gamma distribution is used to model rate heterogeneity in phylogenetics, its mean is set to 1 because it is being used to model relative rates. In this application, only the shape parameter \alpha is ever specified, and that is because the scale parameter \beta is determined by the constraint that the mean is 1.0.

What is the value of \beta if \alpha is known and the mean must be 1.0?

Creating a Histogram

Let's begin by drawing 1000 random numbers (deviates) from a gamma distribution (shape=0.5, scale=2.0) and creating a histogram. Type the following command into the R console to generate the random deviates and store them in the variable x:

x = rgamma(1000, shape=0.5, scale=2.0)

It may appear that nothing has happened, but you can view the 1000 values stored in the variable x by typing x at the prompt.

Creating a basic histogram from these 1000 values is easy:

hist(x)

Now refine the histogram by telling R that you want there to be 50 bars:

hist(x, breaks=50)
Now you create a histogram with 40 bars from 10000 deviates from a Gamma(shape=10, scale=0.1) distribution