Phylogenetics: SMap lab
|EEB 5349: Phylogenetics|
|In this lab you will learn how to use the program sMap, written by Giorgio Bianchini and Patricia Sánchez-Baracaldo. sMap uses stochastic character mapping to assess correlation among discrete characters.|
- 1 Download sMap
- 2 Making sMap easier to invoke
- 3 Minimal Example
- 4 Download the data for flower type and self incompatibility in Pontederiaceae
- 5 Two traits
- 6 ML analyses
- 7 Performing a D test
- 8 Literature Cited
I will assume that you will be running sMap on the Xanadu cluster, but you may wish to run it on your Windows or Mac laptop instead to avoid the tedium of transferring PDF files from Xanadu back to your laptop for viewing. If you want to just run sMap on your laptop, visit the sMap GitHub page and follow the directions there to download the version for your platform (as well as the zip file for Tutorial8).
Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash
Use curl to download the tar-gzipped file sMap-linux-x64.tar.gz (note the L in -LO is important in this case because the URL involves a redirect that should be followed to get to the actual file):
Unpack the tar file:
tar zxvf sMap-linux-x64.tar.gz
You can now delete the tar.gz file if you want:
Making sMap easier to invoke
The executable files sMap, Blend-sMap, Merge-sMap, and Stat-sMap are buried inside the directory created when you extracted the tar file (sMap-linux-x64). You can either always specify the full path to this file when you want to run these programs, or you can use the following suggestion to make them easier to invoke.
Modify your executable path
The simplest approach to solving this problem is to inform your shell (bash) to look in the sMap-linux-x64 whenever it is searching for a program that you've asked it to run. There is an environmental variable called PATH that tells bash where to look for executables. We want to add sMap-linux-x64 while not forgetting about all the other directories currently specified in PATH. Add this line to the bottom of your .bash_profile file. (Note: files beginning with a period are hidden by default; run ls -la to see all files, including these normally-hidden ones.)
It is important to not insert any spaces except for the one separating export and PATH.
The :$PATH at the end is what includes everything already stored in the PATH variable. You do not need to modify your .bashrc file if you use this solution because the PATH environmental variable will be passed along when creating new shells (e.g. with srun).
Before going any further, I will warn you that if you ever get a System.IO.DirectoryNotFoundException error, then the problem is almost certainly that you did not create the output directory into which you've told sMap to save output files.
When learning new software, it is useful to create the simplest possible example before trying something complex. This will allow us to understand what is saved in the output files.
Create a directory named smlab in your home directory, and a directory minimal inside that (the -p tells mkdir to create intermediate directories if needed):
cd mkdir -p smlab/minimal cd smlab/minimal
Now create a file named test.tre that contains the following tree description:
Create a file named data.txt containing the following imaginary data for one discrete trait that has states X and Y (the first line says there are 4 taxa and 1 character):
4 1 A X B X C Y D Y
Now create a model file named erml.nex that specifies an ER (equal rates) model and uses maximum likelihood (ML) to estimate parameters:
#NEXUS Begin Rates; Character: 0; Rates: X > Y: ML Y > X: Equal(X > Y) ; End;
Create a directory in which the output will be saved:
cd ~/smlab/minimal mkdir output
Now run sMap, telling it to create 2 mappings (I've found it always crashes if you try to do just 1 mapping):
sMap -s 12345 -t test.tre -d data.txt -o output/minimal -i erml.nex -n 2
The -s 12345 sets the random number seed so that you will get the same answer as me.
Now zip up the output directory:
zip output.zip output/*
Finally, move output.zip back to your laptop (it will be easier to see the PDF files if it is on your local machine).
Tour of the output files
The minimal.modelstat.txt file lists AIC, BIC, and related numbers in case you want to compare the ER model to other models for these data.
The minimal.smap.bin file is not of direct interest because it is not a text file and can only be read by sMap itself and other programs in the sMap suite.
The minimal.paramNames.txt file just lists the parameters that were estimated. There is only one, the rate that X changes to Y (since this is an ER model, the Y to X rate is the same).
The minimal.params.txt and minimal.mean.params.txt files show the maximum likelihood estimates of the model parameters. The root frequencies were fixed at 0.5, so the only estimated parameter was the rate at which X changes to Y (or vice versa), which was estimated to be about 0.134.
- If you doubled the lengths of all edges in the tree, can you predict what the estimated rate would be?
Take a look at both minimal.smap.pdf (in your favorite PDF viewer) and minimal.smap.tre (in a text editor such as Notepad++ or BBEdit).
- Why are there two lines in minimal.smap.tre?
- Comparing these two files, can you determine the exact extent of the sojourn in state X on the top half of the lineage leading to the tip node C?
To confirm your answer to that last question, run the interesting Nodeinfo program, specifying the minimal.smap.bin file and node 5:
NodeInfo -s minimal.smap.bin -n 5
Use the left and right arrow buttons to move the pointer along the edge. This will allow you to determine where the red colored section begins and ends (approximately). You can also use the up/down arrows to choose other nodes to view. Press the Q key to exit the NodeInfo program.
Now compare minimal.likelihood.mean.pdf with minimal.loglikelihoods.log. The latter file lists the log likelihoods for each state at each node. Note that there are 7 nodes in the tree (easy to see in the pdf file) and there are 7 X,Y pairs in the log file.
- Why does node 0 have -Infinity for state X and 0 for state Y?
- Which node numbers correspond to taxa A and B?
- Which node numbers correspond to taxa C and D?
- Note that the likelihood of X is about equal to the likelihood of Y at the root node. What is the fraction of X using only likelihoods?
- Why are the two lines identical?
The files minimal.branchprobs.pdf and minimal.branchprobs.tree.pdf graphically illustrate the numbers in the minimal.branchprobs.txt file. These numbers represent the proportion of mapping attempts that succeed, so 0.8 means that 20% of the time a mapping failed to end in the state that was chosen for the end of the edge.
- How can this information be useful?
The files minimal.priors.log goes with minimal.prior.mean.pdf, and the files minimal.posteriors.log goes with minimal.posterior.mean.pdf. The log files show what state was observed or selected at each node in the tree using the prior and posterior, respectively. (I must admit that I am not sure I understand why the prior is used to select states at internal nodes. This does not seem necessary for stochastic mapping.)
Finally, the file minimal.all.mean.pdf graphically summarizes the likelihoods, priors, and posteriors at each node. This is an interesting way to graphically depict all components of Bayes' rule simultaneously!
Download the data for flower type and self incompatibility in Pontederiaceae
Create a directory for this lab and navigate into it:
cd ~/smlab mkdir pont cd pont
Download the Pontederiaceae data for the tutorial using curl:
curl -L https://github.com/arklumpus/sMap/blob/master/Tutorials/Tutorial8/Tutorial8.zip?raw=true > Tutorial8.zip unzip Tutorial8.zip rm Tutorial8.zip
In this tutorial, we will be exploring the evolutionary correlation between flower type and self-(in)compatibility in the plant family Pontederiaceae. This lab is basically a shortened version of similar tutorials in the sMap manual. If it turns out that you want to use sMap in your own research, I recommend that you take the time to go through the very thorough tutorials in the sMap manual. The manual is exemplary in its combination of clear overviews of models and methods and detailed and extensive tutorials. I have shortened the original tutorials because they require more computing effort than we have time for.
The Pontederiaceae is a family of aquatic flowering plants (monocots) that attracted the attention of Charles Darwin because of the presence of a form of heterostyly known as tristyly in some species. In tristylous species, individual plants produce one of 3 types of flowers:
- style short and stamen filaments long and intermediate
- style intermediate and stamen filaments short and long
- style long and stamen filaments short and intermediate
Each of these morphs thus deposits pollen on a pollinator in locations that specifically physically preclude fertilization in flowers of the same morph, reducing the amount of self-fertilization. In some tristylous species, outcrossing is further promoted by self-incompatibility, which means that no fertilization would occur even if pollen were accidentally transferred from one flower to another on the same plant. We might ask whether self-incompatibility and tristyly are evolutionarily correlated, and, if so, does self-incompatibility tend to evolve in already-tristylous species, or vice versa?
Mating system trait
The data for the mating system trait is in the file Pontederiaceae_flower.txt. The three states are
- M: monomorphic (flowers are all the same)
- E: enantiostylous (flowers are either "left-handed" or "right-handed" with respect to style position)
- T: tristylous (flowers are one of the three types described above)
We will largely ignore the E state, concentrating on the changes between the M and T flower type. Photos of the three flower morphs (short, mediium, and long styles) composing the T (tristyly) state in Pontederia subovata can be seen in Figure 1 of the paper by Puentes et al. (2013).
The data for the self-compatibility trait is in the file Pontederiaceae_self.txt. The two states are
- C: self-compatible
- I: self-incompatible
Note that data for both traits in all 24 taxa are in the file Pontederiaceae.txt.
Nucleotide sequence data from the rbcL and ndhF genes were used for a Bayesian MCMC analysis, resulting in 1000 sampled trees from the posterior (file Pontederiaceae.treedist) as well as a 50% majority-rule consensus tree (Pontederiaceae.tre).
Let's begin by simply estimating the rate parameters of an all-rates-different (ARD) model (and for simpler models such as SYM and ER) for each of the two traits.
ML analysis of the flower morphology trait
Create 3 subdirectories inside your smlab directory in which to place output files:
cd ~/smlab/pont mkdir flower-ard mkdir flower-sym mkdir flower-er
Now run sMap as follows:
sMap -t Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower-ard/mlard -n 1
The -t tells sMap to use this tree file, the -d informs it of the data file to use, -o specifies the prefix to use for output files, and -n tells it to simulate 1 mapping (we're using sMap to obtain maximum likelihood estimates at this point, not really to do stochastic mapping, but it requires us to do at least one mapping). After the rather interesting display shown while optimizing the likelihood, you will see a table of maximum likelihood estimates of rates.
- What two states never seem to transition to each other?
- This is an ARD model, but do you think a SYM model would suffice? Why?
Make a copy of the model choice statistics (AIC, AICc, BIC) and perform the same analysis with an SYM model:
sMap -t Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower-sym/mlsym -i flower_models/Pontederiaceae_flower.model.ML.SYM.nex -n 1
Note that, this time, we had to use the -i command line option to include a model file specifying the details of the SYM model. We did not have to do this for the ARD model because the ARD model is the default model.
- Was your expectation correct? Is SYM the better model?
- Take a look at the model file: how many rate parameters are estimated using maximum likelihood in this model?
- One pathology associated with overparameterized models is that MLEs of parameters end up at extremes, such as 0 or infinity. Are any parameters equal to 0 or infinity in the ARD and SYM models?
- Is going to an even simpler ER model justified?
ML analysis of the self-incompatibility trait
Perform the same analysis on the self-incompatibility trait. Create 2 subdirectories inside your smlab directory in which to place output files:
cd ~/smlab mkdir self-ard mkdir self-er
Be sure to use the ML versions of the ER model file for the self-compatibility trait, and note that this model file is not in the flower_models directory (can you tell that I spent some little amount of time looking for it?).
- Why is there no model file provided for the SYM model in the case of the self-incompatibility trait?
- Is ER better than ARD for this trait?
Performing a D test
Let's finish by generating 1000 stochastic mappings for each trait, calculating the D statistic to measure the degree of association between the two traits, and calculate posterior predictive P-values to test whether D is significant.
Generate stochastic mappings for the flower type trait
Create a directory called flower inside your smlab directory:
cd ~/smlab/pont mkdir flower
Now run sMap as follows:
sMap -t Pontederiaceae.treedist -T Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower/sym -n 1000 -i flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex --pp 100
Here is a summary of the command line arguments:
- -t specifies the trees to be used (these are from analyses of the nucleotide data)
- -T specifies the tree to use for displaying results
- -d specifies the morphological data for the flower trait only
- -o specifies the output file prefix
- -n specifies the number of mappings to perform
- -i specifies the model to be used
- --pp specifies the number of posterior predictive samples to be drawn for each of the 1000 parameter values that were estimated (one from each tree topology sampled)
We've told the program sMap to generate 100 posterior predictive mappings for every one of the 1000 stochastic mappings that are conditioned on the actual data: that's 100*1000, or 100,000 mappings! Note that the posterior predictive mappings are not saved anywhere that you can view them; they are stored in the file flower/sym.smap.bin, which cannot be viewed in a text editor. You can and should look at the file flower/sym.smap.pdf, however, as that shows a summary of the 1000 mappings that were conditioned on the flower type trait data.
- What is the difference between a posterior-predictive mapping and a standard mapping?
We've switched to doing a Bayesian MCMC analysis now rather than using ML. That means we need priors for rates and root frequencies. These are specified in the model file flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex. Take a look at that file now:
- What prior distribution was used for each of the estimated rate parameters?
This program takes a very practical approach to incorporating uncertainty about the tree and edge lengths. It uses a sample of 1000 trees from the posterior distribution produced by a previous Bayesian analysis (carried out using a separate program BEAST). This saves having to include nucleotide substitution models and the whole Bayesian phylogeny machinery into sMap. As the sMap MCMC runs, it simply randomly chooses a tree each iteration from the 1000 trees in the file Pontederiaceae.treedist. You can see these tree choices if you look at the T column in either one of the log files (flower/sym.run1.log or flower/sym.run2.log).
- Is that a statistically sound way to incorporate uncertainty in tree topology and edge lengths?
Generate stochastic mappings for the self-incompatibility trait
Now we need to do the same thing for the self-incompatibility trait.
Create a directory called self inside your smaplab directory:
cd ~/smlab/pont mkdir self
Run sMap as follows:
sMap -t Pontederiaceae.treedist -T Pontederiaceae.tre -d Pontederiaceae_self.txt -o self/ard -n 1000 -i Pontederiaceae_self.model.Bayes.ARD.nex --pp 100
Merge the two traits
We now have two directories full of results: flower and self. The only files we really need are flower/sym.smap.bin and self/ard.smap.bin. These files contain (in a compressed , cryptic, binary format) all the mappings and posterior predictive mappings needed to perform the D test. The next step is to create a single bin file that combines, or merges, these two separate bin files.
Merge-sMap -o merged.smap.bin -n 2000 -s "self/ard.smap.bin;0" -s "flower/sym.smap.bin;0"
Here is a brief description of the arguments to Merge-sMap:
- -o specifies the output binary file
- -n specifies the number of mappings to store in the output file
- -s specifies one of the input binary files
The semicolon followed by 0 in the -s arguments indicate that we are selecting the first (0th) character stored in each of those files. Note that each of these files only contains one character, so specifying anything except 0 would cause an error.
The -n 2000 says to generate and save 2000 joint mappings. Each of these joint mappings will involve choosing one mapping at random from the flower file and one mapping at random from the self file. So, you can actually save more mappings than there are in either input file!
One last thing. Note that I have surrounded the arguments following each -s in double quotes. That's important because if the bash shell sees a semicolon it treats it as the end of the current command. The quotes prevent bash from "seeing" these semicolons so they will be passed along to Merge-sMap instead.
Perform the D test
Now we're finally ready to do the test. Use the Stat-sMap program to read in the merged binary file and perform the test.
Stat-sMap -s merged.smap.bin -t 0 1 dtest
Key to arguments:
- -s is the input binary file
- -t specifies which traits in that input file to use (there are only two traits, 0 and 1)
- dtest is the name we want to use for the output file (.pdf will be added to this prefix)
Take a look at the file dtest.pdf and answer these final questions:
- What was the value of the overall D statistic?
- Was this overall D statistically significant?
- Is the flower state T (tristyly) positively or negatively associated with I (self-incompatibility)?
- Is the flower state M (monomorphic) positively or negatively associated with C (self-compatibility)?
- What fraction of the D-statistic distribution is to the right of the vertical dotted line labeled E[D|X] = 0.1880?
So sMap found that flower type and self-incompatibility are not significantly associated. Thus, we must conclude that the amount of association shown by these two characters is consistent with the degree of correlation that two independently evolving characters would show due simply to the fact that species share some of their history with other species. The study by Huelsenbeck et al. (2013) found that the posterior predictive P-values ranged from 0.81 to 0.952 depending on the priors assumed for the rates in the two characters. These P-values are defined as the cumulative probability to the left of the observed D, whereas the P-value in sMap is defined as the exact opposite (the amount of tail probability to the right of the observed D), so the Huelsenbeck et al. numbers translate to 0.048 to 0.19. Our P-value is in that range.
For points for today's lab, please send Katie your final dtest.pdf file.
JP Huelsenbeck, R Nielsen, and JP Bollback. 2003, Stochastic mapping of morphological characters. Systematic Biology 52(2):131–158. PDF
A Puentes, WW Cole, and SCH Barrett. 2013. Trimorphic incompatibility in Pontederia subovata (Pontederiaceae): an aquatic macrophyte from lowland South America. International Journal of Plant Sciences 174(1): 47-56. PDF