Phylogenetics: SMap lab

From EEBedia
Revision as of 01:04, 8 April 2020 by Paul Lewis (Talk | contribs) (Making sMap easier to invoke)

Jump to: navigation, search
Adiantum.png EEB 349: Phylogenetics
In this lab you will learn how to use the program sMap, written by Giorgio Bianchini and Patricia Sánchez-Baracaldo. sMap uses stochastic character mapping to assess correlation among discrete characters.

Download sMap

I will assume that you will be running sMap on the Xanadu cluster, but you may wish to run it on your Windows or Mac laptop instead to avoid the tedium of transferring PDF files from Xanadu back to your laptop for viewing. If you want to just run sMap on your laptop, visit the sMap GitHub page and follow the directions there to download the version for your platform (as well as the zip file for Tutorial8).

Login to Xanadu and request a machine as usual:

srun --pty -p mcbstudent --qos=mcbstudent bash

Use curl to download the tar-gzipped file sMap-linux-x64.tar.gz (note the L in -LO is important in this case because the URL involves a redirect that should be followed to get to the actual file):

curl -LO

Unpack the tar file:

tar zxvf sMap-linux-x64.tar.gz

You can now delete the tar.gz file if you want:

rm sMap-linux-x64.tar.gz

Making sMap easier to invoke

The executable file sMap is buried inside the directory created when you extracted the tar file. You can either always specify the full path to this file when you want to run it, or you can use a couple of different methods to make invoking it easier. One of the simplest methods is to create an alias:

alias sMap="$HOME/sMap-linux-x64/sMap"

Now typing sMap at the console, whereever you are, will invoke the program (try it!). You can see any aliases you've defined by typing alias at the command prompt.

Important: Note that there is a unix command named smap that your alias would replace if it were named smap. It is highly unlikely that you will ever use the smap command, but it is probably wise to not create aliases that replace existing functionality with something entirely different. So, I've named my alias sMap to avoid confusion (and that may be why the software itself has that capital M in the name).

If you want your alias to be present whenever you login to Xanadu, you need to add it to your .bash_profile and .bashrc files. Files beginning with a dot are hidden by default, so you will not see them when you type ls in your home directory. You can see them if you tell the ls command that you want to see everything:

ls -la

Edit the .bash_profile file (which may not even exist yet) and add (on a line by itself at the very bottom of the file) the same alias command that I specified above. This will be loaded the next time you log into Xanadu. You will also need to add your alias definition to the .bashrc file if you want your alias to be available after you invoke srun to jump to an unused node. The .bash_profile file is loaded when you first login, but it is .bashrc that is used when starting new bash shells after logging in.

Minimal Example

It is useful to analyze the simplest possible example before trying something complex. This will allow us to understand what is saved in the output files.

Create a directory named smlab in your home directory, and a directory minimal inside that (the -p tells mkdir to create intermediate directories if needed):

mkdir -p smlab/minimal
cd smlab/minimal

Now create a file named test.tre that contains the following tree description:


Create a file named data.txt containing the following imaginary data for one trait (the first line says there are 4 taxa and 1 character):

4	1					
B	X 
D    Y

Now create a model file named erml.nex that specifies an ER (equal rates) model and uses maximum likelihood (ML) to estimate parameters:


Begin Rates;

    Character: 0;

        X > Y: ML
        Y > X: Equal(X > Y)


Create a directory in which the output will be saved:

cd ~/smlab/minimal
mkdir output

Now run sMap, telling it to create 2 mappings (I've found it always crashes if you try to do just 1 mapping):

sMap -s 12345 -t test.tre -d data.txt -o output/minimal -i erml.nex -n 2 

The -s 12345 sets the random number seed so that you will get the same answer as me.

Now zip up the output directory:

zip output/*

Finally, move back to your laptop (it will be easier to see the PDF files if it is on your local machine).

Tour of the output files

Take a look at minimal.smap.pdf (in your favorite PDF viewer) and minimal.smap.tre (in a text editor such as Notepad++ or BBEdit).

  • Comparing these two files, can you determine the exact extent of the sojourn in state X on the top half of the lineage leading to the tip node C? answer

Downloading the data for the tutorial

Create a directory for this lab and navigate into it:

cd # to ensure that you are in your home directory
mkdir smaplab
cd smlab

Download the Pontederiaceae data for the tutorial using curl:

curl -L >

Two traits

We will be interested in exploring the evolutionary correlation between mating system and self-(in)compatibility in the plant family Pontederiaceae. This lab is basically a shortened version of similar tutorials in the sMap manual. If it turns out that you want to use sMap in your own research, I recommend that you take the time to go through the tutorials in the sMap manual, which is exemplary in its combination of clear overviews of models and methods and detailed and extensive tutorials.

The Pontederiaceae is a family of aquatic flowering plants (monocots) that attracted the attention of Charles Darwin because of the presence of a form of heterostyly known as tristyly in some species. In tristylous species, individual plants produce one of 3 types of flowers:

  • style short and stamen filaments long and intermediate
  • style intermediate and stamen filaments short and long
  • style long and stamen filaments short and intermediate

Each of these morphs thus deposits pollen on a pollinator in locations that specifically physically preclude fertilization in flowers of the same morph, reducing the amount of self-fertilization at the expense of wasting some pollen. In some tristylous species, outcrossing is further promoted by self-incompatibility, which means that even if pollen were accidentally transferred from one flower to another on the same plant, no fertilization would occur. We might ask whether self-incompatibility and tristyly are evolutionarily correlated, and, if so, does self-incompatibility tend to evolve in already-tristylous species, or vice versa?

Mating system trait

The data for the mating system trait is in the file Pontederiaceae_flower.txt. The three states are

  • M: monomorphic (flowers are all the same)
  • E: enantiostylous (flowers are either "left-handed" or "right-handed" with respect to style position)
  • T: tristylous (flowers are one of the three types described above)

Photos of the three flower morphs in Pontederia subovata can be seen in Figure 1 of the paper by Puentes et al. (2013).

Self-compatibility trait

The data for the self-compatibility trait is in the file Pontederiaceae_self.txt. The two states are

  • C: self-compatible
  • I: self-incompatible

Note that data for both traits in all 24 taxa are in the file Pontederiaceae.txt.

Sequence data

Nucleotide sequence data from the rbcL and ndhF genes were used for a Bayesian MCMC analysis, resulting in 1000 sampled trees from the posterior (file Pontederiaceae.treedist) as well as a 50% majority-rule consensus tree (Pontederiaceae.tre).

ML analyses

Let's begin by simply estimating the rate parameters of an all-rates-different (ARD) model (and for simpler models such as SYM and ER) for each of the two traits.

ML analysis of the flower morphology trait

Create 3 subdirectories inside your smlab directory in which to place output files:

cd ~/smlab
mkdir flower-ard
mkdir flower-sym
mkdir flower-er

Now run sMap as follows:

sMap -t Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower-ard/mlard -n 1

The -t tells sMap to use this tree file, the -d informs it of the data file to use, -o specifies the prefix to use for output files, and -n tells it to simulate 1 mapping (we're using sMap to obtain maximum likelihood estimates at this point, not really to do stochastic mapping, but it requires us to do at least one mapping). After the rather interesting display shown while optimizing the likelihood, you will see a table of maximum likelihood estimates of rates.

  • What two states never seem to transition to each other? answer
  • This is an ARD model, but do you think a SYM model would suffice? Why? answer

Make a copy of the model choice statistics (AIC, AICc, BIC) and perform the same analysis with an SYM model:

 sMap -t Pontederiaceae.tre -d Pontederiaceae_flower.txt -o flower-sym/mlsym -i flower_models/Pontederiaceae_flower.model.ML.SYM.nex -n 1
  • Was your expectation correct? Is SYM the better model? answer
  • One pathology associated with overparameterized models is that MLEs of parameters end up at extremes, such as 0 or infinity. Are any parameters equal to 0 or infinity in the ARD and SYM models? answer
  • Is going to an even simpler ER model justified? note

ML analysis of the self-incompatibility trait

Perform the same analysis on the self-incompatibility trait. Create 2 subdirectories inside your smlab directory in which to place output files:

cd ~/smlab
mkdir self-ard
mkdir self-er
  • Why is there no model file provided for the SYM model in the case of the self-incompatibility trait? answer
  • Is ER better than ARD for this trait? answer

Stochastic mapping for the self-incompatibility trait

Create a directory called self inside your smaplab directory:

cd ~/smaplab
mkdir self

Now run sMap as follows:

sMap -t Pontederiaceae.treedist -T Pontederiaceae.tre -d Pontederiaceae_self.txt -o self/ard -n 1000 -i Pontederiaceae_self.model.Bayes.ARD.nex --pp 100

Here is a summary of the command line arguments:

  • -t specifies the trees to be used (these are from analyses of the nucleotide data)
  • -T specifies the tree to use for displaying results
  • -d specifies the morphological data
  • -o specifies the output file prefix
  • -n specifies the number of mappings to perform
  • -i specifies the model to be used
  • --pp specifies the number of posterior predictive samples to be drawn for each of the 1000 parameter values that were estimated (one from each tree topology)

Literature Cited

A Puentes, WW Cole, and SCH Barrett. 2013. Trimorphic Incompatibility in Pontederia subovata (Pontederiaceae): An Aquatic Macrophyte from Lowland South America. International Journal of Plant Sciences 174(1): 47-56. PDF