Difference between revisions of "Phylogenetics: SMap lab"

Revision as of 15:23, 10 April 2020

	EEB 5349: Phylogenetics
	In this lab you will learn how to use the program sMap, written by Giorgio Bianchini and Patricia Sánchez-Baracaldo. sMap uses stochastic character mapping to assess correlation among discrete characters.

Download sMap

I will assume that you will be running sMap on the Xanadu cluster, but you may wish to run it on your Windows or Mac laptop instead to avoid the tedium of transferring PDF files from Xanadu back to your laptop for viewing. If you want to just run sMap on your laptop, visit the sMap GitHub page and follow the directions there to download the version for your platform (as well as the zip file for Tutorial8).

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Use curl to download the tar-gzipped file sMap-linux-x64.tar.gz (note the L in -LO is important in this case because the URL involves a redirect that should be followed to get to the actual file):

curl -LO https://github.com/arklumpus/sMap/releases/download/v1.0.5-2/sMap-linux-x64.tar.gz

Unpack the tar file:

tar zxvf sMap-linux-x64.tar.gz

You can now delete the tar.gz file if you want:

rm sMap-linux-x64.tar.gz

Making sMap easier to invoke

The executable files sMap, Blend-sMap, Merge-sMap, and Stat-sMap are buried inside the directory created when you extracted the tar file (sMap-linux-x64). You can either always specify the full path to this file when you want to run these programs, or you can use the following suggestion to make them easier to invoke.

Modify your executable path

The simplest approach to solving this problem is to inform your shell (bash) to look in the sMap-linux-x64 whenever it is searching for a program that you've asked it to run. There is an environmental variable called PATH that tells bash where to look for executables. We want to add sMap-linux-x64 while not forgetting about all the other directories currently specified in PATH. Add this line to the bottom of your .bash_profile file. (Note: files beginning with a period are hidden by default; run ls -la to see all files, including these normally-hidden ones.)

export PATH="$HOME/sMap-linux-x64:$PATH"

It is important to not insert any spaces except for the one separating export and PATH.

The :$PATH at the end is what includes everything already stored in the PATH variable. You do not need to modify your .bashrc file if you use this solution because the PATH environmental variable will be passed along when creating new shells (e.g. with srun).

Minimal Example

Before going any further, I will warn you that if you ever get a System.IO.DirectoryNotFoundException error, then the problem is almost certainly that you did not create the output directory into which you've told sMap to save output files.

When learning new software, it is useful to create the simplest possible example before trying something complex. This will allow us to understand what is saved in the output files.

Create a directory named smlab in your home directory, and a directory minimal inside that (the -p tells mkdir to create intermediate directories if needed):

cd 
mkdir -p smlab/minimal
cd smlab/minimal

Now create a file named test.tre that contains the following tree description:

((A:1,B:1):3,(C:2,D:2):2);

Create a file named data.txt containing the following imaginary data for one discrete trait that has states X and Y (the first line says there are 4 taxa and 1 character):

Now create a model file named erml.nex that specifies an ER (equal rates) model and uses maximum likelihood (ML) to estimate parameters:

#NEXUS

Begin Rates;

    Character: 0;

    Rates:
        X > Y: ML
        Y > X: Equal(X > Y)
        ;

End;

Create a directory in which the output will be saved:

cd ~/smlab/minimal
mkdir output

Now run sMap, telling it to create 2 mappings (I've found it always crashes if you try to do just 1 mapping):

sMap -s 12345 -t test.tre -d data.txt -o output/minimal -i erml.nex -n 2

The -s 12345 sets the random number seed so that you will get the same answer as me.

Now zip up the output directory:

zip output.zip output/*

Finally, move output.zip back to your laptop (it will be easier to see the PDF files if it is on your local machine).

Tour of the output files

The minimal.modelstat.txt file lists AIC, BIC, and related numbers in case you want to compare the ER model to other models for these data.

The minimal.smap.bin file is not of direct interest because it is not a text file and can only be read by sMap itself and other programs in the sMap suite.

The minimal.paramNames.txt file just lists the parameters that were estimated. There is only one, the rate that X changes to Y (since this is an ER model, the Y to X rate is the same).

The minimal.params.txt and minimal.mean.params.txt files show the maximum likelihood estimates of the model parameters. The root frequencies were fixed at 0.5, so the only estimated parameter was the rate at which X changes to Y (or vice versa), which was estimated to be about 0.134.

If you doubled the lengths of all edges in the tree, can you predict what the estimated rate would be? answer

Take a look at both minimal.smap.pdf (in your favorite PDF viewer) and minimal.smap.tre (in a text editor such as Notepad++ or BBEdit).

Why are there two lines in minimal.smap.tre? answer
Comparing these two files, can you determine the exact extent of the sojourn in state X on the top half of the lineage leading to the tip node C? answer

To confirm your answer to that last question, run the interesting Nodeinfo program, specifying the minimal.smap.bin file and node 5:

NodeInfo -s minimal.smap.bin -n 5

Use the left and right arrow buttons to move the pointer along the edge. This will allow you to determine where the red colored section begins and ends (approximately). You can also use the up/down arrows to choose other nodes to view. Press the Q key to exit the NodeInfo program.

Now compare minimal.likelihood.mean.pdf with minimal.loglikelihoods.log. The latter file lists the log likelihoods for each state at each node. Note that there are 7 nodes in the tree (easy to see in the pdf file) and there are 7 X,Y pairs in the log file.

Why does node 0 have -Infinity for state X and 0 for state Y? answer
Which node numbers correspond to taxa A and B? answer
Which node numbers correspond to taxa C and D? answer
Note that the likelihood of X is about equal to the likelihood of Y at the root node. What is the fraction of X using only likelihoods? answer
Why are the two lines identical? answer

The files minimal.branchprobs.pdf and minimal.branchprobs.tree.pdf graphically illustrate the numbers in the minimal.branchprobs.txt file. These numbers represent the proportion of mapping attempts that succeed, so 0.8 means that 20% of the time a mapping failed to end in the state that was chosen for the end of the edge.

How can this information be useful? answer

The files minimal.priors.log goes with minimal.prior.mean.pdf, and the files minimal.posteriors.log goes with minimal.posterior.mean.pdf. The log files show what state was observed or selected at each node in the tree using the prior and posterior, respectively. (I must admit that I am not sure I understand why the prior is used to select states at internal nodes. This does not seem necessary for stochastic mapping.)

Finally, the file minimal.all.mean.pdf graphically summarizes the likelihoods, priors, and posteriors at each node. This is an interesting way to graphically depict all components of Bayes' rule simultaneously!

Download the data for flower type and self incompatibility in Pontederiaceae

Create a directory for this lab and navigate into it:

cd ~/smlab
mkdir pont
cd pont

Download the Pontederiaceae data for the tutorial using curl:

curl -L https://github.com/arklumpus/sMap/blob/master/Tutorials/Tutorial8/Tutorial8.zip?raw=true > Tutorial8.zip
unzip Tutorial8.zip
rm Tutorial8.zip

Two traits

In this tutorial, we will be exploring the evolutionary correlation between flower type and self-(in)compatibility in the plant family Pontederiaceae. This lab is basically a shortened version of similar tutorials in the sMap manual. If it turns out that you want to use sMap in your own research, I recommend that you take the time to go through the very thorough tutorials in the sMap manual. The manual is exemplary in its combination of clear overviews of models and methods and detailed and extensive tutorials. I have shortened the original tutorials because they require more computing effort than we have time for.

The Pontederiaceae is a family of aquatic flowering plants (monocots) that attracted the attention of Charles Darwin because of the presence of a form of heterostyly known as tristyly in some species. In tristylous species, individual plants produce one of 3 types of flowers:

style short and stamen filaments long and intermediate
style intermediate and stamen filaments short and long
style long and stamen filaments short and intermediate

Each of these morphs thus deposits pollen on a pollinator in locations that specifically physically preclude fertilization in flowers of the same morph, reducing the amount of self-fertilization. In some tristylous species, outcrossing is further promoted by self-incompatibility, which means that no fertilization would occur even if pollen were accidentally transferred from one flower to another on the same plant. We might ask whether self-incompatibility and tristyly are evolutionarily correlated, and, if so, does self-incompatibility tend to evolve in already-tristylous species, or vice versa?

Mating system trait

The data for the mating system trait is in the file Pontederiaceae_flower.txt. The three states are

M: monomorphic (flowers are all the same)
E: enantiostylous (flowers are either "left-handed" or "right-handed" with respect to style position)
T: tristylous (flowers are one of the three types described above)

We will largely ignore the E state, concentrating on the changes between the M and T flower type. Photos of the three flower morphs (short, mediium, and long styles) composing the T (tristyly) state in Pontederia subovata can be seen in Figure 1 of the paper by Puentes et al. (2013).

Self-compatibility trait

The data for the self-compatibility trait is in the file Pontederiaceae_self.txt. The two states are

C: self-compatible
I: self-incompatible

Note that data for both traits in all 24 taxa are in the file Pontederiaceae.txt.

Sequence data

Nucleotide sequence data from the rbcL and ndhF genes were used for a Bayesian MCMC analysis, resulting in 1000 sampled trees from the posterior (file Pontederiaceae.treedist) as well as a 50% majority-rule consensus tree (Pontederiaceae.tre).

ML analyses

Let's begin by simply estimating the rate parameters of an all-rates-different (ARD) model (and for simpler models such as SYM and ER) for each of the two traits.

ML analysis of the flower morphology trait

Create 3 subdirectories inside your smlab directory in which to place output files:

cd ~/smlab/pont
mkdir flower-ard
mkdir flower-sym
mkdir flower-er

Now run sMap as follows:

sMap -t Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower-ard/mlard -n 1

The -t tells sMap to use this tree file, the -d informs it of the data file to use, -o specifies the prefix to use for output files, and -n tells it to simulate 1 mapping (we're using sMap to obtain maximum likelihood estimates at this point, not really to do stochastic mapping, but it requires us to do at least one mapping). After the rather interesting display shown while optimizing the likelihood, you will see a table of maximum likelihood estimates of rates.

What two states never seem to transition to each other? answer
This is an ARD model, but do you think a SYM model would suffice? Why? answer

Make a copy of the model choice statistics (AIC, AICc, BIC) and perform the same analysis with an SYM model:

 sMap -t Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower-sym/mlsym -i flower_models/Pontederiaceae_flower.model.ML.SYM.nex -n 1

Note that, this time, we had to use the -i command line option to include a model file specifying the details of the SYM model. We did not have to do this for the ARD model because the ARD model is the default model.

Was your expectation correct? Is SYM the better model? answer
Take a look at the model file: how many rate parameters are estimated using maximum likelihood in this model? answer
One pathology associated with overparameterized models is that MLEs of parameters end up at extremes, such as 0 or infinity. Are any parameters equal to 0 or infinity in the ARD and SYM models? answer
Is going to an even simpler ER model justified? note

ML analysis of the self-incompatibility trait

Perform the same analysis on the self-incompatibility trait. Create 2 subdirectories inside your smlab directory in which to place output files:

cd ~/smlab
mkdir self-ard
mkdir self-er

Be sure to use the ML versions of the ER model file for the self-compatibility trait, and note that this model file is not in the flower_models directory (can you tell that I spent some little amount of time looking for it?).

Why is there no model file provided for the SYM model in the case of the self-incompatibility trait? answer
Is ER better than ARD for this trait? answer

Performing a D test

Let's finish by generating 1000 stochastic mappings for each trait, calculating the D statistic to measure the degree of association between the two traits, and calculate posterior predictive P-values to test whether D is significant.

Generate stochastic mappings for the flower type trait

Create a directory called flower inside your smlab directory:

cd ~/smlab/pont
mkdir flower

Now run sMap as follows:

sMap -t Pontederiaceae.treedist -T Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower/sym -n 1000 -i flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex --pp 100

Here is a summary of the command line arguments:

-t specifies the trees to be used (these are from analyses of the nucleotide data)
-T specifies the tree to use for displaying results
-d specifies the morphological data for the flower trait only
-o specifies the output file prefix
-n specifies the number of mappings to perform
-i specifies the model to be used
--pp specifies the number of posterior predictive samples to be drawn for each of the 1000 parameter values that were estimated (one from each tree topology sampled)

We've told the program sMap to generate 100 posterior predictive mappings for every one of the 1000 stochastic mappings that are conditioned on the actual data: that's 100*1000, or 100,000 mappings! Note that the posterior predictive mappings are not saved anywhere that you can view them; they are stored in the file flower/sym.smap.bin, which cannot be viewed in a text editor. You can and should look at the file flower/sym.smap.pdf, however, as that shows a summary of the 1000 mappings that were conditioned on the flower type trait data.

What is the difference between a posterior-predictive mapping and a standard mapping? answer

We've switched to doing a Bayesian MCMC analysis now rather than using ML. That means we need priors for rates and root frequencies. These are specified in the model file flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex. Take a look at that file now:

cat flower_models/Pontederiaceae_flower.model.Bayes.SYM.nex

What prior distribution was used for each of the estimated rate parameters? answer

This program takes a very practical approach to incorporating uncertainty about the tree and edge lengths. It uses a sample of 1000 trees from the posterior distribution produced by a previous Bayesian analysis (carried out using a separate program BEAST). This saves having to include nucleotide substitution models and the whole Bayesian phylogeny machinery into sMap. As the sMap MCMC runs, it simply randomly chooses a tree each iteration from the 1000 trees in the file Pontederiaceae.treedist. You can see these tree choices if you look at the T column in either one of the log files (flower/sym.run1.log or flower/sym.run2.log).

Is that a statistically sound way to incorporate uncertainty in tree topology and edge lengths? answer

Generate stochastic mappings for the self-incompatibility trait

Now we need to do the same thing for the self-incompatibility trait.

Create a directory called self inside your smaplab directory:

cd ~/smlab/pont
mkdir self

Run sMap as follows:

sMap -t Pontederiaceae.treedist -T Pontederiaceae.tre -d Pontederiaceae_self.txt -o self/ard -n 1000 -i Pontederiaceae_self.model.Bayes.ARD.nex --pp 100

Merge the two traits

We now have two directories full of results: flower and self. The only files we really need are flower/sym.smap.bin and self/ard.smap.bin. These files contain (in a compressed , cryptic, binary format) all the mappings and posterior predictive mappings needed to perform the D test. The next step is to create a single bin file that combines, or merges, these two separate bin files.

Merge-sMap -o merged.smap.bin -n 2000 -s "self/ard.smap.bin;0" -s "flower/sym.smap.bin;0"

Here is a brief description of the arguments to Merge-sMap:

-o specifies the output binary file
-n specifies the number of mappings to store in the output file
-s specifies one of the input binary files

The semicolon followed by 0 in the -s arguments indicate that we are selecting the first (0th) character stored in each of those files. Note that each of these files only contains one character, so specifying anything except 0 would cause an error.

The -n 2000 says to generate and save 2000 joint mappings. Each of these joint mappings will involve choosing one mapping at random from the flower file and one mapping at random from the self file. So, you can actually save more mappings than there are in either input file!

One last thing. Note that I have surrounded the arguments following each -s in double quotes. That's important because if the bash shell sees a semicolon it treats it as the end of the current command. The quotes prevent bash from "seeing" these semicolons so they will be passed along to Merge-sMap instead.

Perform the D test

Now we're finally ready to do the test. Use the Stat-sMap program to read in the merged binary file and perform the test.

Stat-sMap -s merged.smap.bin -t 0 1 dtest

Key to arguments:

-s is the input binary file
-t specifies which traits in that input file to use (there are only two traits, 0 and 1)
dtest is the name we want to use for the output file (.pdf will be added to this prefix)

Interpreting the D test

Take a look at the file dtest.pdf and answer these final questions:

What was the value of the overall D statistic? answer
Was this overall D statistically significant? answer
Is the flower state T (tristyly) positively or negatively associated with I (self-incompatibility)? answer
Is the flower state M (monomorphic) positively or negatively associated with C (self-compatibility)? answer
What fraction of the D-statistic distribution is to the right of the vertical dotted line labeled E[D|X] = 0.1880? answer

So sMap found that flower type and self-incompatibility are not significantly associated. Thus, we must conclude that the amount of association shown by these two characters is consistent with the degree of correlation that two independently evolving characters would show due simply to the fact that species share some of their history with other species. The study by Huelsenbeck et al. (2013) found that the posterior predictive P-values ranged from 0.81 to 0.952 depending on the priors assumed for the rates in the two characters. These P-values are defined as the cumulative probability to the left of the observed D, whereas the P-value in sMap is defined as the exact opposite (the amount of tail probability to the right of the observed D), so the Huelsenbeck et al. numbers translate to 0.048 to 0.19. Our P-value is in that range.

Plotting the merged stochastic mapping results

You have pdf files showing summaries of the stochastic mapping results for the two traits separately, but what is of most interest is a plot showing how the two traits coevolve. This was done in the process of calculating the D statistic, but there is no pdf file after running Stat-sMap that shows a stochastic mapping summary of both traits together. Fortunately, sMap makes this easy to obtain. Make sure you are still in the directory containing the merged stochastic mapping results, merged.smap.bin, and type the following command:

Plot-sMap -s merged.smap.bin

This will open up a pseudographical interface that you can navigate using arrow keys and select things using the return key. Note that you can change a lot of things about the plots, including the type of plot produced, state colors, font sizes, etc. For now, just navigate to the Plot... button at the bottom and press return. This will generate a plot (merged.smap.pdf) that shows how both characters evolved on the tree (it is a summary of the 2000 mappings that were specified).

Note in this plot how Heteranthera dubia changed from E to M flowers but maintained the ancestral self compatibility. Now open and compare flower/sym.branchprobs.tree.pdf with self/ard.branchprobs.tree.pdf and look specifically at the edge leading to Heteranthera dubia.

Explain why the branch probability on the edge leading to the tip H. dubia is low for the flower trait but high for the self-compatibility trait? answer

Finishing up

To get points for today's lab, please send Katie your final dtest.pdf file and final merged.smap.pdf.

Literature Cited

JP Huelsenbeck, R Nielsen, and JP Bollback. 2003, Stochastic mapping of morphological characters. Systematic Biology 52(2):131–158. PDF

A Puentes, WW Cole, and SCH Barrett. 2013. Trimorphic incompatibility in Pontederia subovata (Pontederiaceae): an aquatic macrophyte from lowland South America. International Journal of Plant Sciences 174(1): 47-56. PDF

@@ Line 331: / Line 331: @@
 == Finishing up ==
-For points for today's lab, please send Katie your final dtest.pdf file and final merged.smap.pdf.
+To get points for today's lab, please send Katie your final dtest.pdf file and final merged.smap.pdf.
 == Literature Cited ==