Difference between revisions of "Phylogenetics: SMap lab"

From EEBedia
Jump to: navigation, search
(Stochastic mapping for the flower type trait)
(Performing a D test)
Line 245: Line 245:
  
 
We've told the program sMap to generate 100 posterior predictive mappings for every one of the 1000 stochastic mappings that are conditioned on the actual data: that's 100*1000, or 100,000 mappings! Note that the posterior predictive mappings are not saved anywhere that you can view them; they are stored in the file flower/sym.smap.bin, which cannot be viewed in a text editor. You can and should look at the file flower/sym.smap.pdf, however, as that shows a summary of the 1000 mappings that were conditioned on the flower type trait data.
 
We've told the program sMap to generate 100 posterior predictive mappings for every one of the 1000 stochastic mappings that are conditioned on the actual data: that's 100*1000, or 100,000 mappings! Note that the posterior predictive mappings are not saved anywhere that you can view them; they are stored in the file flower/sym.smap.bin, which cannot be viewed in a text editor. You can and should look at the file flower/sym.smap.pdf, however, as that shows a summary of the 1000 mappings that were conditioned on the flower type trait data.
 +
 +
<div style="background-color: #ccccff">
 +
* ''What is the difference between a posterior-predictive mapping and a standard mapping?''  {{title|the posterior predictive mapping is exactly the same as a standard mapping except that it conditions on newly simulated data rather than the original data|answer}}
 +
</div>
 +
 +
We've switched to doing a Bayesian MCMC analysis now rather than using ML. That means we need priors for rates and root frequencies. These are specified in the model file ''flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex''. Take a look at that file now:
 +
cat flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex
 +
 +
<div style="background-color: #ccccff">
 +
* ''What prior distribution was used for each of the estimated rate parameters?''  {{title|a LogNormal distribution was used for the rate from T to M and the rate from M to E, an Exponential distribution was used for the rate from T to E, and a flat Dirichlet distribution was used for root state frequencies|answer}}
 +
</div>
 +
 +
This program takes a very practical approach to incorporating uncertainty about the tree and edge lengths. It uses a sample of 1000 trees from the posterior distribution produced by a previous Bayesian analysis (carried out using a separate program BEAST). This saves having to include nucleotide substitution models and the whole Bayesian phylogeny machinery into sMap. As the sMap MCMC runs, it simply randomly chooses a tree each iteration from the 1000 trees in the file ''Pontederiaceae.treedist''. You can see these tree choices if you look at the <tt>T</tt> column in either one of the log files (''flower/sym.run1.log'' or ''flower/sym.run2.log'').
 +
 +
<div style="background-color: #ccccff">
 +
* ''Is that a statistically sound way to incorporate uncertainty in tree topology and edge lengths?''  {{title|it is valid, but does assume that the data used to generate the tree posterior distribution are different than the data used to estimate the substitution rates for different flower type transitions on those trees. There is certainly room for concern that branch lengths from molecular data are not appropriate for the morphological character under study|answer}}
 +
</div>
  
 
== Stochastic mapping for the self-incompatibility trait ==
 
== Stochastic mapping for the self-incompatibility trait ==

Revision as of 01:00, 10 April 2020

Adiantum.png EEB 349: Phylogenetics
In this lab you will learn how to use the program sMap, written by Giorgio Bianchini and Patricia Sánchez-Baracaldo. sMap uses stochastic character mapping to assess correlation among discrete characters.

Download sMap

I will assume that you will be running sMap on the Xanadu cluster, but you may wish to run it on your Windows or Mac laptop instead to avoid the tedium of transferring PDF files from Xanadu back to your laptop for viewing. If you want to just run sMap on your laptop, visit the sMap GitHub page and follow the directions there to download the version for your platform (as well as the zip file for Tutorial8).

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Use curl to download the tar-gzipped file sMap-linux-x64.tar.gz (note the L in -LO is important in this case because the URL involves a redirect that should be followed to get to the actual file):

curl -LO https://github.com/arklumpus/sMap/releases/download/v1.0.5-2/sMap-linux-x64.tar.gz

Unpack the tar file:

tar zxvf sMap-linux-x64.tar.gz

You can now delete the tar.gz file if you want:

rm sMap-linux-x64.tar.gz

Making sMap easier to invoke

The executable files sMap, Blend-sMap, Merge-sMap, and Stat-sMap are buried inside the directory created when you extracted the tar file (sMap-linux-x64). You can either always specify the full path to this file when you want to run these programs, or you can use the following suggestion to make them easier to invoke.

Modify your executable path

The simplest approach to solving this problem is to inform your shell (bash) to look in the sMap-linux-x64 whenever it is searching for a program that you've asked it to run. There is an environmental variable called PATH that tells bash where to look for executables. We want to add sMap-linux-x64 while not forgetting about all the other directories currently specified in PATH. Add this line to the bottom of your .bash_profile file. (Note: files beginning with a period are hidden by default; run ls -la to see all files, including these normally-hidden ones.)

export PATH="$HOME/sMap-linux-x64:$PATH"

It is important to not insert any spaces except for the one separating export and PATH.

The :$PATH at the end is what includes everything already stored in the PATH variable. You do not need to modify your .bashrc file if you use this solution because the PATH environmental variable will be passed along when creating new shells (e.g. with srun).

Minimal Example

Before going any further, I will warn you that if you ever get a System.IO.DirectoryNotFoundException error, then the problem is almost certainly that you did not create the output directory into which you've told sMap to save output files.

When learning new software, it is useful to create the simplest possible example before trying something complex. This will allow us to understand what is saved in the output files.

Create a directory named smlab in your home directory, and a directory minimal inside that (the -p tells mkdir to create intermediate directories if needed):

cd 
mkdir -p smlab/minimal
cd smlab/minimal

Now create a file named test.tre that contains the following tree description:

((A:1,B:1):3,(C:2,D:2):2);

Create a file named data.txt containing the following imaginary data for one discrete trait that has states X and Y (the first line says there are 4 taxa and 1 character):

4    1					
A    X
B    X 
C    Y
D    Y

Now create a model file named erml.nex that specifies an ER (equal rates) model and uses maximum likelihood (ML) to estimate parameters:

#NEXUS

Begin Rates;

    Character: 0;

    Rates:
        X > Y: ML
        Y > X: Equal(X > Y)
        ;

End;

Create a directory in which the output will be saved:

cd ~/smlab/minimal
mkdir output

Now run sMap, telling it to create 2 mappings (I've found it always crashes if you try to do just 1 mapping):

sMap -s 12345 -t test.tre -d data.txt -o output/minimal -i erml.nex -n 2 

The -s 12345 sets the random number seed so that you will get the same answer as me.

Now zip up the output directory:

zip output.zip output/*

Finally, move output.zip back to your laptop (it will be easier to see the PDF files if it is on your local machine).

Tour of the output files

The minimal.modelstat.txt file lists AIC, BIC, and related numbers in case you want to compare the ER model to other models for these data.

The minimal.smap.bin file is not of direct interest because it is not a text file and can only be read by sMap itself and other programs in the sMap suite.

The minimal.paramNames.txt file just lists the parameters that were estimated. There is only one, the rate that X changes to Y (since this is an ER model, the Y to X rate is the same).

The minimal.params.txt and minimal.mean.params.txt files show the maximum likelihood estimates of the model parameters. The root frequencies were fixed at 0.5, so the only estimated parameter was the rate at which X changes to Y (or vice versa), which was estimated to be about 0.134.

  • If you doubled the lengths of all edges in the tree, can you predict what the estimated rate would be? answer

Take a look at both minimal.smap.pdf (in your favorite PDF viewer) and minimal.smap.tre (in a text editor such as Notepad++ or BBEdit).

  • Why are there two lines in minimal.smap.tre? answer
  • Comparing these two files, can you determine the exact extent of the sojourn in state X on the top half of the lineage leading to the tip node C? answer

To confirm your answer to that last question, run the interesting Nodeinfo program, specifying the minimal.smap.bin file and node 5:

NodeInfo -s minimal.smap.bin -n 5

Use the left and right arrow buttons to move the pointer along the edge. This will allow you to determine where the red colored section begins and ends (approximately). You can also use the up/down arrows to choose other nodes to view. Press the Q key to exit the NodeInfo program.

Now compare minimal.likelihood.mean.pdf with minimal.loglikelihoods.log. The latter file lists the log likelihoods for each state at each node. Note that there are 7 nodes in the tree (easy to see in the pdf file) and there are 7 X,Y pairs in the log file.

  • Why does node 0 have -Infinity for state X and 0 for state Y? answer
  • Which node numbers correspond to taxa A and B? answer
  • Which node numbers correspond to taxa C and D? answer
  • Note that the likelihood of X is about equal to the likelihood of Y at the root node. What is the fraction of X using only likelihoods? answer
  • Why are the two lines identical? answer

The files minimal.branchprobs.pdf and minimal.branchprobs.tree.pdf graphically illustrate the numbers in the minimal.branchprobs.txt file. These numbers represent the proportion of mapping attempts that succeed, so 0.8 means that 20% of the time a mapping failed to end in the state that was chosen for the end of the edge.

  • How can this information be useful? answer

The files minimal.priors.log goes with minimal.prior.mean.pdf, and the files minimal.posteriors.log goes with minimal.posterior.mean.pdf. The log files show what state was observed or selected at each node in the tree using the prior and posterior, respectively. (I must admit that I am not sure I understand why the prior is used to select states at internal nodes. This does not seem necessary for stochastic mapping.)

Finally, the file minimal.all.mean.pdf graphically summarizes the likelihoods, priors, and posteriors at each node. This is an interesting way to graphically depict all components of Bayes' rule simultaneously!

Download the data for flower type and self incompatibility in Pontederiaceae

Create a directory for this lab and navigate into it:

cd ~/smlab
mkdir pont
cd pont

Download the Pontederiaceae data for the tutorial using curl:

curl -L https://github.com/arklumpus/sMap/blob/master/Tutorials/Tutorial8/Tutorial8.zip?raw=true > Tutorial8.zip
unzip Tutorial8.zip
rm Tutorial8.zip

Two traits

In this tutorial, we will be exploring the evolutionary correlation between flower type and self-(in)compatibility in the plant family Pontederiaceae. This lab is basically a shortened version of similar tutorials in the sMap manual. If it turns out that you want to use sMap in your own research, I recommend that you take the time to go through the very thorough tutorials in the sMap manual. The manual is exemplary in its combination of clear overviews of models and methods and detailed and extensive tutorials. I have shortened the original tutorials because they require more computing effort than we have time for.

The Pontederiaceae is a family of aquatic flowering plants (monocots) that attracted the attention of Charles Darwin because of the presence of a form of heterostyly known as tristyly in some species. In tristylous species, individual plants produce one of 3 types of flowers:

  • style short and stamen filaments long and intermediate
  • style intermediate and stamen filaments short and long
  • style long and stamen filaments short and intermediate

Each of these morphs thus deposits pollen on a pollinator in locations that specifically physically preclude fertilization in flowers of the same morph, reducing the amount of self-fertilization. In some tristylous species, outcrossing is further promoted by self-incompatibility, which means that no fertilization would occur even if pollen were accidentally transferred from one flower to another on the same plant. We might ask whether self-incompatibility and tristyly are evolutionarily correlated, and, if so, does self-incompatibility tend to evolve in already-tristylous species, or vice versa?

Mating system trait

The data for the mating system trait is in the file Pontederiaceae_flower.txt. The three states are

  • M: monomorphic (flowers are all the same)
  • E: enantiostylous (flowers are either "left-handed" or "right-handed" with respect to style position)
  • T: tristylous (flowers are one of the three types described above)

We will largely ignore the E state, concentrating on the changes between the M and T flower type. Photos of the three flower morphs (short, mediium, and long styles) composing the T (tristyly) state in Pontederia subovata can be seen in Figure 1 of the paper by Puentes et al. (2013).

Self-compatibility trait

The data for the self-compatibility trait is in the file Pontederiaceae_self.txt. The two states are

  • C: self-compatible
  • I: self-incompatible

Note that data for both traits in all 24 taxa are in the file Pontederiaceae.txt.

Sequence data

Nucleotide sequence data from the rbcL and ndhF genes were used for a Bayesian MCMC analysis, resulting in 1000 sampled trees from the posterior (file Pontederiaceae.treedist) as well as a 50% majority-rule consensus tree (Pontederiaceae.tre).

ML analyses

Let's begin by simply estimating the rate parameters of an all-rates-different (ARD) model (and for simpler models such as SYM and ER) for each of the two traits.

ML analysis of the flower morphology trait

Create 3 subdirectories inside your smlab directory in which to place output files:

cd ~/smlab/pont
mkdir flower-ard
mkdir flower-sym
mkdir flower-er

Now run sMap as follows:

sMap -t Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower-ard/mlard -n 1

The -t tells sMap to use this tree file, the -d informs it of the data file to use, -o specifies the prefix to use for output files, and -n tells it to simulate 1 mapping (we're using sMap to obtain maximum likelihood estimates at this point, not really to do stochastic mapping, but it requires us to do at least one mapping). After the rather interesting display shown while optimizing the likelihood, you will see a table of maximum likelihood estimates of rates.

  • What two states never seem to transition to each other? answer
  • This is an ARD model, but do you think a SYM model would suffice? Why? answer

Make a copy of the model choice statistics (AIC, AICc, BIC) and perform the same analysis with an SYM model:

 sMap -t Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower-sym/mlsym -i flower_models/Pontederiaceae_flower.model.ML.SYM.nex -n 1

Note that, this time, we had to use the -i command line option to include a model file specifying the details of the SYM model. We did not have to do this for the ARD model because the ARD model is the default model.

  • Was your expectation correct? Is SYM the better model? answer
  • Take a look at the model file: how many rate parameters are estimated using maximum likelihood in this model? answer
  • One pathology associated with overparameterized models is that MLEs of parameters end up at extremes, such as 0 or infinity. Are any parameters equal to 0 or infinity in the ARD and SYM models? answer
  • Is going to an even simpler ER model justified? note

ML analysis of the self-incompatibility trait

Perform the same analysis on the self-incompatibility trait. Create 2 subdirectories inside your smlab directory in which to place output files:

cd ~/smlab
mkdir self-ard
mkdir self-er

Be sure to use the ML versions of the ER model file for the self-compatibility trait, and note that this model file is not in the flower_models directory (can you tell that I spent some little amount of time looking for it?).

  • Why is there no model file provided for the SYM model in the case of the self-incompatibility trait? answer
  • Is ER better than ARD for this trait? answer

Performing a D test

Let's finish by generating 1000 stochastic mappings for each trait, calculating the D statistic to measure the degree of association between the two traits, and calculate posterior predictive P-values to test whether D is significant.

Generate stochastic mappings for the flower type trait

Create a directory called flower inside your smlab directory:

cd ~/smlab/pont
mkdir flower

Now run sMap as follows:

sMap -t Pontederiaceae.treedist -T Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower/sym -n 1000 -i flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex --pp 100

Here is a summary of the command line arguments:

  • -t specifies the trees to be used (these are from analyses of the nucleotide data)
  • -T specifies the tree to use for displaying results
  • -d specifies the morphological data for the flower trait only
  • -o specifies the output file prefix
  • -n specifies the number of mappings to perform
  • -i specifies the model to be used
  • --pp specifies the number of posterior predictive samples to be drawn for each of the 1000 parameter values that were estimated (one from each tree topology sampled)

We've told the program sMap to generate 100 posterior predictive mappings for every one of the 1000 stochastic mappings that are conditioned on the actual data: that's 100*1000, or 100,000 mappings! Note that the posterior predictive mappings are not saved anywhere that you can view them; they are stored in the file flower/sym.smap.bin, which cannot be viewed in a text editor. You can and should look at the file flower/sym.smap.pdf, however, as that shows a summary of the 1000 mappings that were conditioned on the flower type trait data.

  • What is the difference between a posterior-predictive mapping and a standard mapping? answer

We've switched to doing a Bayesian MCMC analysis now rather than using ML. That means we need priors for rates and root frequencies. These are specified in the model file flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex. Take a look at that file now:

cat flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex
  • What prior distribution was used for each of the estimated rate parameters? answer

This program takes a very practical approach to incorporating uncertainty about the tree and edge lengths. It uses a sample of 1000 trees from the posterior distribution produced by a previous Bayesian analysis (carried out using a separate program BEAST). This saves having to include nucleotide substitution models and the whole Bayesian phylogeny machinery into sMap. As the sMap MCMC runs, it simply randomly chooses a tree each iteration from the 1000 trees in the file Pontederiaceae.treedist. You can see these tree choices if you look at the T column in either one of the log files (flower/sym.run1.log or flower/sym.run2.log).

  • Is that a statistically sound way to incorporate uncertainty in tree topology and edge lengths? answer

Stochastic mapping for the self-incompatibility trait

Create a directory called self inside your smaplab directory:

cd ~/smlab/pont
mkdir self

Now run sMap as follows:

sMap -t Pontederiaceae.treedist -T Pontederiaceae.tre -d Pontederiaceae_self.txt -o self/ard -n 1000 -i Pontederiaceae_self.model.Bayes.ARD.nex --pp 100

Merge the two traits

Merge-sMap -o merged.smap.bin -n 2000 -s "self/ard.smap.bin;0" -s "flower/sym.smap.bin;0"

Perform the D test

Stat-sMap -s merged.smap.bin -t 0 1 dtest

Literature Cited

A Puentes, WW Cole, and SCH Barrett. 2013. Trimorphic incompatibility in Pontederia subovata (Pontederiaceae): an aquatic macrophyte from lowland South America. International Journal of Plant Sciences 174(1): 47-56. PDF