|EEB 5349: Phylogenetics|
To introduce you to the maximum likelihood software IQ-Tree, and to show you how to visualize and annotate trees with the R package ggtree.
Both IQ-Tree (2015) and ggtree (2017) are relative newcombers to their respective arenas, so let's road-test them. If that's not cutting-edge enough for you, we will be using them with the data from a 2018 study that was accepted on February 9th. It's not even officially published yet! The authors of the study used a Bayesian phylogeny as the basis for their study, but did not show results for a maximum likelihood phylogeny. Typically, studies present the results of both analyses, to examine any sensitivity their data may have to the phylogenetic inference method used. In order to help the authors out, we will infer a maximum likelihood phylogeny for them, and see if it recovers the same topology as their Bayesian phylogeny.
The study by Condamine et al. (2018) focuses on Apollo (Parnassius) butterflies which live high up in alpine regions of the Holarctic, and have long been a favorite genus for collectors due to their beauty and relatively high endemism (collectors like hard to find things). Being so well-known, their taxonomy is also well-known and relatively stable. In other words, we know what all the species are. This makes the group a good candidate for the type of comparative phylogenetic study that the authors performed. Being alpine specialists, Apollo butterflies are declining worldwide as they move up mountains, trying to keep pace with rising temperatures. Also of note is the use of mating plugs by male Apollos to control paternity.
The authors were interested in testing test two hypotheses about the main drivers of diversification and macroevolution: the Red Queen and Court Jester Hypotheses. The Red Queen hypothesis, coined by Leigh Van Valen in 1973, posits that species must constantly adapt through evolutionary time in order to survive in a world of ever-evolving species (it's an allusion to Through the Looking-Glass by Lewis Carroll). On the other hand, The Court Jester Hypothesis, coined by Anthony Barnosky in 1999, posits that it is interactions between species and the non-living world around it that are the main drivers of diversification and macroevolution.
In order to test these hypotheses the authors first needed a time-calibrated phylogeny. They used previously published DNA data from Apollo (Parnassius), coupled with two fossil specimens to generate a complete (as in all 85 living species in the genus are included) time-calibrated phylogeny. Knowing the absolute, and not just relative dates of the nodes, the authors were able to use historical climatic and geological data (court-jester), as well as inferred ancestral species ranges (Red Queen) to try to tease apart who rules the Parnassius Court: The Red Queen or the Jester?
Get Their Data
The authors deposited the data they used in the digital repository Dryad. Go to Dryad and search for the authors' study. Each study has a digital object identifier (DOI) that uniquely identifies it and can be used to easily search for it. The DOIs are typically printed near the beginning of an article as part of a URL. The DOI number is everything after the ".org". "https://doi.org/[everything here is the DOI number]"
We need the data they used in their MrBayes inference (Appendix 3), and information about how they partitioned their data (Appendix 9).
Appendix 3 is simply the NEXUS file they passed to MrBayes: it has character state information along with commands to MrBayes about how to partition the data. The character state information in this file is from both DNA and morphological data. When these two sources of information are used to infer a phylogeny it's termed a "total-evidence" approach (which seems a little presumptuous to this author!)
IQ-Tree can handle combined morphological and molecular datasets, thanks in part to work by Paul Lewis. I mean, Dr. Paul Lewis. Sorry, I mean Dr. Paul O. Lewis. Despite IQ-Tree being able to handle the combined data set of Condamine et al. (2018) we're going to simplify things a bit and just infer a phylogeny with the DNA data. Yes we are changing two variables (changing the inference framework from Bayesian to maximum likelihood and removing morphological data), but nonetheless our analysis will serve to test the sensitivity of the topology of their phylogeny to methodological changes.
Open the NEXUS file in a text editor. Look through the file to get a feel for it. Notice the familiar data matrix of nucleotides towards the beginning of the file, and the morphological data matrix below it. The morphological data consists of two states for a given character: 0 for absent, and 1 for present. They assessed some, but not all, of the species in their dataset for morphology. Scroll back up to the top of the document.
- Why is there no DNA data for taxa Thaites ruminiana or Doritites bosniackii?
Remove the fossil taxa from the nucleotide matrix, and delete the morphological matrix entirely. Return to the top of the document and note the lines:
DIMENSIONS NTAX=96 NCHAR=4771; FORMAT DATATYPE = mixed(DNA:1-4535,Standard:4536-4771) interleave=yes GAP = - MISSING = ?;
Our data is no longer "mixed" after taking out the morphological data (i.e. there's no more "Standard" data). Our "DATATYPE" is now simply "DNA". Because it's only DNA, we don't need to specify which characters are DNA. But we still need to specify how many total characters (number of nucleotide sites in this case) are in the dataset, as well as how many taxa. Make the necessary changes to these two lines of the NEXUS file and save the file as apollo.nex
Appendix 9 has the results of their ParitionFinder analysis. PartitionFinder is one of many programs that will take a look at your data and attempt to place similarly evolving sites into the same bin or partition. One can also parition their data manually using a priori knowledge about their sites (e.g. put different genes into different partitions, and different codon positions into different partitions.)
Open up the "best_scheme.txt file and note the lines:
Subset | Best Model | Subset Partitions | Subset Sites | Alignment 1 | GTR+I+G | COI_Pos1 | 1-1489\3 | ./analysis/phylofiles/ca30e3ea7b9c739c115638123d4f1dd0.phy 2 | TIM+I+G | COI_Pos2, ND1_Pos2, ND5_Pos2 | 2-1490\3, 1493-1961\3, 1964-2777\3 | ./analysis/phylofiles/c2e9c416b67aadc24570829688eefbda.phy 3 | TIM+G | COI_Pos3 | 3-1491\3 | ./analysis/phylofiles/5d40dcf0c7fe8811737c39985b3a08af.phy 4 | GTR+I+G | 16S, ND1_Pos1, ND5_Pos1 | 1492-1960\3, 1963-2776\3, 2778-3314 | ./analysis/phylofiles/118cc2b9a278fd1acb34d328d95d92f6.phy 5 | TIM+G | ND1_Pos3, ND5_Pos3 | 1494-1962\3, 1965-2775\3 | ./analysis/phylofiles/3c2d866b9dc71dad74d7571f2ca99381.phy 6 | TIM+I+G | EF1_Pos1 | 3315-4533\3 | ./analysis/phylofiles/9980e10b91c1a16683fe75f404068d6f.phy 7 | TrNef+I+G | EF1_Pos2 | 3316-4534\3 | ./analysis/phylofiles/aea6ba92a6c9f49b27ab397b8cb58e28.phy 8 | GTR+I+G | EF1_Pos3 | 3317-4535\3 | ./analysis/phylofiles/759f0f0ff5902ef4d5a8d520f1a0c4a6.phy
ParitionFinder has determined the partitions and the best-fitting model of nucleotide substitution for each partition. Now we need to get this into a format that IQ-Tree will understand. Although NEXUS files are perfectly capable of housing data and partitioning schemes, IQ-Tree seems to want them separated so we can't just add a sets block to the apollo.nex file you created earlier. Create a new file named apollo_partition.nex and add the partitioning scheme information to it like so:
Begin sets; charset COI_Pos1 = 1-1489\3; charset COI_Pos2_ND1_Pos2_ND5_Pos2 = 2-1490\3 1493-1961\3 1964-2777\3; charset COI_Pos3 = 3-1491\3; charset 16S_ND1_Pos1_ND5_Pos1 = 1492-1960\3 1963-2776\3 2778-3314;
And so on until you have added all of the partitions.
Next add a line to indicate which partition gets which model of nucleotide substitution:
charpartition favored = GTR+I+G:COI_Pos1, GTR+I+G:COI_Pos2_ND1_Pos2_ND5_Pos2, GTR+I+G:COI_Pos3, GTR+I+G:16S_ND1_Pos1_ND5_Pos1,...
And don't forget to End; the sets block!
R and ggtree
We've used R before to run our chi-squared tests, but that was from the command line. R has a wonderful integrated development environment (IDE) which is very helpful when you need to write an R script (a sequence of many commands). Download the latest version of R Studio here.
red queen hypothesis court-jester hypothesis ggtree iqtree