Phylogenetics: Modeltest Lab
From eebedia
EEB 349: Phylogenetics | |
The goal of this lab exercise is to (a) introduce you to Modeltest, which uses PAUP* to perform likelihood ratio tests and organizes the results to make it easier to choose a substitution model, and (b) gain experience with probability distributions by plotting several gamma distributions in Excel and computing areas under portions of these distributions. |
Contents |
Part A: Modeltest
This section of the lab is intended to familiarize you with a very popular program (Modeltest) for conducting model selection for purposes of likelihood or Bayesian phylogenetics. Modeltest provides a PAUP block that will cause PAUP* to calculate and save the maximum likelihood scores of each of 56 distinct models to a file. Modeltest can then be used to conduct likelihood ratio tests, and AIC or BIC comparisons. Note: much of what Modeltest does you have already done yourself in previous lab and homework exercises. Modeltest goes beyond what we've done in class thus far, however, in providing model-averaged estimates of parameters (more on this later...)
- First, download Modeltest from the Modeltest website
- Unpack the zip file you downloaded to create a directory named Modeltest3.7 folder (you can ignore the folder named __MACOSX
- Download algae.nex and save it in the paupblock folder (which is inside Modeltest3.7 folder)
Running PAUP* to create the model.scores file
- Start PAUP* and execute the algae.nex file
- Open (edit) the file modelblockPAUPb10 (in the paupblock folder) in PAUP*. This is simply a Nexus command file that tells PAUP* to first estimate a tree and then find the maximum likelihood estimates for 56 different models.
- What method (parsimony, distance, or likelihood? search strategy?) does Modeltest use to find the test upon which all results will be based?
Running Modeltest to analyze the model.scores file
PAUP* will now begin computing likelihoods and saving them to a file named model.scores.When PAUP* has finished, you need to run Modeltest on the model.scores file to conduct all model comparisons.
- The model.scores file should be in the paupblock folder. Inside that same folder, create a file named go.bat, containing this text:
..\bin\Modeltest3.7.win.exe -b -n920 < model.scores > output.txt pause
- Double-click this batch file to run it. This will run the modeltest program (which resides in the bin directory inside Modeltest3.7 folder. The model.scores file will serve as input to the Modeltest program, and the output will be saved in output.txt. Open the Modeltest3.7.pdf file in the doc folder and find out what the -b and -n920 options mean. This file also presents a nice overview of the theory behind what Modeltest does.
Examine the output.txt file and answer these questions (ask for help if you have trouble):
- What model was selected using hierarchical likelihood ratio tests?
- Looking at the parameter estimates, how does this model differ from the HKY85+G model?
- What model was selected using the Bayesian Information Criterion (BIC)?
In the table titled MODEL SELECTION UNCERTAINTY, the 56 models are listed in order of their posterior probabilities (as estimated using BIC).
- Which models together account for 95% of the posterior probability?
At the end of the output.txt file, look for the table titled MODEL AVERAGING AND PARAMETER IMPORTANCE. This table lists parameter estimates based on all the models. For each model in which a parameter occurs, the estimate is weighted by the BIC weight for that model. If you are looking for a good, overall estimate of a model parameter, this is one of the best places to find it. These parameter estimates to some extent take account of uncertainty in the substitution model. (Personally, I would base parameter estimates on a better tree than the default used by Modeltest, but that is easy to change in the command file that PAUP* analyzes.)
The importance of a particular parameter is also based on BIC weights. For example, in my analysis, pinv (in +I models only) had importance value 0.0309, which means that the BIC weights of all 14 models containing pinv (but not gamma shape also) added up to only 0.0309.
- Looking through the PARAMETER IMPORTANCE table, what is the most important model parameter for this data set?
A word about Mrmodeltest
Because MrBayes implements only 24 of the 56 models examined by Modeltest, some MrBayes users experienced frustration when Modeltest suggested a "best" model that they could not use. The program Mrmodeltest only examines the 24 nucleotide models implemented in MrBayes, so you might want to explore that program if you anticipate analyzing your data in MrBayes rather than PAUP*.
Part B: Gamma distributions in Excel
Plotting a gamma distribution
Create a plot of a Gamma(10, 0.5) distribution in Excel by following these instructions:
- Open Microsoft Excel
- In cell A5, enter the word shape, and in the next cell to the right (B5), enter the numerical value 10.0
- In cell A6, enter the word scale, and in the next cell to the right (B6), enter the numerical value 0.5
- In cell A7, enter the word mean, and in the next cell to the right (B7), enter the cell formula =B5*B6 (the mean of a gamma distribution is the product of the shape and scale parameters)
- In cell A8, enter the word variance, and in the next cell to the right (B8), enter the cell formula =B5*B6^2 (the variance of a gamma distribution is the shape times the scale parameter squared)
- In cell A10, enter 0.0
- In cell A11, enter =A10+0.1
- Now select cell A11, then drag downward using the black square "handle" in the lower right corner. Drag down until you get to cell A220. Cell A220 should now have the value 21.
- Now move back to the top (Ctrl-Home key), entering the following in cell B10:
=gammadist(A10, $B$5, $B$6, false)
- Select cell B10, hover your mouse over the black handle until the cursor changes to a solid black cross, then double-click to automatically copy that cell's formula all the way down to B220
- Select cells B10:B220, then choose Insert > Chart... from the main menu to bring up the Chart Wizard
- Choose Chart type Line, click on the upper-left chart sub-type (plain line), then click the Next button
- Click the Series tab, then click inside the box labeled Category (X) axis labels. Move over to the worksheet again, and select all cells in the range A10:A220 (the fast way is to select just A10, then press Shift-End followed by Shift-Downarrow). When the range A10:A220 is surrounded by a dotted line, click the Finish button in the Chart Wizard
- Excel will create the chart close to the bottom of the selected cells. Grab the chart and move it to the top of the worksheet to make it easier to see the effect of modifying shape and scale parameters.
You have just plotted a Gamma probability density function having shape parameter 10 and scale parameter 0.5. You should try several other combinations of shape and scale parameters to see what these gamma distributions look like. Answer these questions before moving on:
- What values of the shape parameter cause cell B10 to display #NUM! (indicating that the curve shoots off to infinity at values approaching zero)?
- What values of the shape parameter cause the curve to peak, dropping to zero on the left and right?
Exponential distributions
An Exponential distribution is a special case of the gamma distribution that occurs when the shape parameter is exactly 1.0. Set the shape parameter to 1.0 and play with various values (e.g. 2, 3, 4, 5, etc.) of the scale parameter. Can you determine a pattern by looking at where the curve hits the y-axis when x = 0? Note that you should ignore the value in cell B10 because Excel does not calculate this value correctly for some reason; instead, just sight along the curve as it approaches x = 0.0. You should find that the value of the exponential density function at zero is a function of its mean. What is this relationship?
Gamma distributions and rate heterogeneity
We have encountered gamma distributions in the course previously. They are used to allow rate heterogeneity in substitution models. You may remember that a gamma distribution having a small shape parameter (e.g. 0.1) implies a lot of rate heterogeneity, whereas a large shape parameter (e.g. 100) implies almost no rate heterogeneity. In the rate heterogeneity application, the scale parameter is always equal to the inverse of the shape parameter (which is why no one ever mentions the gamma scale parameter in discussions of rate heterogeneity). Try plotting a gamma distribution with shape = 100 (and scale = 0.001). Now plot a gamma distribution in which shape = 0.1 (and scale = 10).
- Why is scale = 1/shape in this application? Hint: it has to do with ensuring the the mean of the relative rates is 1.0.
Finding areas under a Gamma density curve
Use column C of your worksheet to find the cumulative area under the gamma density from 0.0 up to each value in the A column:
- In cell C10, enter this formula:
=gammadist(A10, $B$5, $B$6, true)
- Replicate the formula down to C220, then create a line plot as you did for the B column
The formula for obtaining the cumulative area under the density is nearly the same as that for obtaining the density itself: the difference lies in whether the fourth argument to the gammadist fuction is true (cumulative distribution) or false (density function).
Set the shape and scale of your gamma distribution back to 10 and 0.5, respectively, then answer these questions (write the numbers you find down for later comparison):
- What is the value now in cell C220? Does this value make sense?
- As the x value increases, the area under the density accumulates. At what x value does the cumulative area first exceed 0.95?
- What is the median of this distribution? Hint: find the x value at which the cumulative area is as close as possible to 0.5.
- What interval captures the middle 95% of this distribution? To answer this, find the x values corresponding to 2.5% and 95%. It is unlikely that you will be able to find the interval precisely, but see how close you can get with the given spacing of x values.
Drawing samples from a Gamma distribution
In a separate worksheet, try implementing these instructions:
- In cell A1, type
=rand()
- Replicate this cell down to A1000. You now have a column containing 1000 uniform(0,1) deviates. A uniform(0,1) deviate is a sample chosen at random from a Uniform distribution between 0.0 and 1.0.
- In cell B1, type
=gammainv(A1, 10, 0.5)
- Replicate this cell down to B1000. You now have (in column B) 1000 random gamma(10,0.5) random deviates.
- Find the sample mean of the 1000 deviates in column B by typing the following in cell D1
=average(B1:B1000)
- Find the sample standard deviation of the 1000 deviates by tying the following into cell D2
=stdev(B1:B1000)
- Find the sample variance by squaring the sample standard deviation. That is, in cell D3, type
=D2^2
Press F9 a few time to recalculate. Watch the mean and variance fluctuate. Recall that for a gamma distribution with shape = 10 and scale = 0.5, the mean is 5 (shape*scale) and the variance is 2.5 (shape*scale^2).
- Do they fluctuate around the theoretical mean and variance?