Difference between revisions of "IQ-Tree"

From EEBedia
Jump to: navigation, search
(Appendix 9)
(The Study)
 
(16 intermediate revisions by 2 users not shown)
Line 6: Line 6:
 
|
 
|
 
|}
 
|}
 +
by Kevin Keegan
 +
 
== Goals ==
 
== Goals ==
  
To introduce you to the maximum likelihood software IQ-Tree, and to show you how to visualize and annotate trees with the R package ggtree.  
+
To introduce you to the maximum likelihood software IQ-Tree.
  
 
== Introduction ==
 
== Introduction ==
  
Both IQ-Tree (2015) and ggtree (2017) are relative newcombers to their respective arenas, so let's road-test them. If that's not cutting-edge enough for you, we will be using them with the data from a 2018 study that was accepted on February 9th. It's not even officially published yet! The authors of the study used a Bayesian phylogeny as the basis for their study, but did not show results for a maximum likelihood phylogeny. Typically, studies present the results of both analyses, to examine any sensitivity their data may have to the phylogenetic inference method used. In order to help the authors out, we will infer a maximum likelihood phylogeny for them, and see if it recovers the same topology as their Bayesian phylogeny.
+
IQ-Tree is a relative newcomer to the maximum likelihood, so let's road-test it. If that's not cutting-edge enough for you, we will be using them with the data from a 2018 study that was accepted on February 9th. It's not even officially published yet! The authors of the study used a Bayesian phylogeny as the basis for their study, but did not show results for a maximum likelihood phylogeny. Typically, studies present the results of both analyses, to examine any sensitivity their data may have to the phylogenetic inference method used. In order to help the authors out, we will infer a maximum likelihood phylogeny for them, and see if it recovers the same topology as their Bayesian phylogeny.
  
 
== The Study ==
 
== The Study ==
Line 18: Line 20:
 
The [https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syy009/4847851 study] by Condamine et al. (2018) focuses on Apollo (Parnassius) butterflies which live high up in alpine regions of the Holarctic, and have long been a favorite genus for collectors due to their beauty and relatively high endemism (collectors like hard to find things). Being so well-known, their taxonomy is also well-known and relatively stable. In other words, we know what all the species are. This makes the group a good candidate for the type of comparative phylogenetic study that the authors performed. Being alpine specialists, Apollo butterflies are declining worldwide as they move up mountains, trying to keep pace with rising temperatures. Also of note is the use of [https://en.wikipedia.org/wiki/Mating_plug mating plugs] by male Apollos to control paternity.   
 
The [https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syy009/4847851 study] by Condamine et al. (2018) focuses on Apollo (Parnassius) butterflies which live high up in alpine regions of the Holarctic, and have long been a favorite genus for collectors due to their beauty and relatively high endemism (collectors like hard to find things). Being so well-known, their taxonomy is also well-known and relatively stable. In other words, we know what all the species are. This makes the group a good candidate for the type of comparative phylogenetic study that the authors performed. Being alpine specialists, Apollo butterflies are declining worldwide as they move up mountains, trying to keep pace with rising temperatures. Also of note is the use of [https://en.wikipedia.org/wiki/Mating_plug mating plugs] by male Apollos to control paternity.   
  
The authors were interested in testing test two hypotheses about the main drivers of diversification and macroevolution: the Red Queen and Court Jester Hypotheses. The Red Queen hypothesis, coined by Leigh Van Valen in 1973, posits that species must constantly adapt through evolutionary time in order to survive in a world of ever-evolving species (it's an allusion to [https://en.wikipedia.org/wiki/Red_Queen_(Through_the_Looking-Glass) Through the Looking-Glass by Lewis Carroll)]. On the other hand, The Court Jester Hypothesis, coined by Anthony Barnosky in 1999, posits that it is interactions between species and the non-living world around it that are the main drivers of diversification and macroevolution.
+
The authors were interested in testing test two hypotheses about the main drivers of diversification and macroevolution: the Red Queen and Court Jester Hypotheses. The Red Queen hypothesis, coined by Leigh Van Valen in 1973, posits that species must constantly adapt through evolutionary time in order to survive in a world of ever-evolving species (it's an allusion to [https://en.wikipedia.org/wiki/Red_Queen_(Through_the_Looking-Glass) Through the Looking-Glass by Lewis Carroll)]. It stresses the importance of biotic interactions as drivers of diversification. On the other hand, The Court Jester Hypothesis, coined by Anthony Barnosky in 1999, posits that it is interactions between a given lineage and the non-living world around it that are the main drivers of diversification and macroevolution, thus emphasizing the importance of abiotic factors in driving diversification.
  
In order to test these hypotheses the authors first needed a time-calibrated phylogeny. They used previously published DNA data from Apollo (Parnassius), coupled with two fossil specimens to generate a complete (as in all 85 living species in the genus are included) time-calibrated phylogeny. Knowing the absolute, and not just relative dates of the nodes, the authors were able to use historical climatic and geological data (court-jester), as well as inferred ancestral species ranges (Red Queen) to try to tease apart who rules the Parnassius Court: The Red Queen or the Jester?
+
In order to test these hypotheses the authors first needed a time-calibrated phylogeny. They used previously published DNA data from Apollo (Parnassius), coupled with two fossil specimens to generate a complete (as in all 85 living species in the genus are included) time-calibrated phylogeny. Knowing the absolute, and not just relative dates of the nodes, the authors were able to use historical climatic and geological data (court-jester), as well as inferred ancestral species ranges (Red Queen) to try to tease apart who rules the Parnassian Court: The Red Queen or the Jester?
  
 
== Get Their Data ==
 
== Get Their Data ==
Line 32: Line 34:
 
Appendix 3 is simply the NEXUS file they passed to MrBayes: it has character state information along with commands to MrBayes about how to partition the data. The character state information in this file is from both DNA and morphological data. When these two sources of information are used to infer a phylogeny it's termed a "total-evidence" approach (which seems a little presumptuous to this author!)  
 
Appendix 3 is simply the NEXUS file they passed to MrBayes: it has character state information along with commands to MrBayes about how to partition the data. The character state information in this file is from both DNA and morphological data. When these two sources of information are used to infer a phylogeny it's termed a "total-evidence" approach (which seems a little presumptuous to this author!)  
  
IQ-Tree can handle combined morphological and molecular datasets, thanks in part to work by [http://www.paullewispiano.co.uk/ Paul Lewis]. I mean, [https://doctor.webmd.com/doctor/paul-lewis-do-d6f980e1-58cc-470e-80f5-912056f2c9f9-appointments Dr. Paul Lewis]. Sorry, I mean [https://academic.oup.com/sysbio/article/50/6/913/1628902 Dr. Paul O. Lewis]. Despite IQ-Tree being able to handle the combined data set of Condamine et al. (2018) we're going to simplify things a bit and just infer a phylogeny with the DNA data. Yes we are changing two variables (changing the inference framework from Bayesian to maximum likelihood and removing morphological data), but nonetheless our analysis will serve to test the sensitivity of the topology of their phylogeny to methodological changes.
+
IQ-Tree can handle combined morphological and molecular datasets. Despite IQ-Tree being able to handle the combined data set of Condamine et al. (2018) we're going to simplify things a bit and just infer a phylogeny with the DNA data. Yes we are changing two variables (changing the inference framework from Bayesian to maximum likelihood and removing morphological data), but nonetheless our analysis will serve to test the sensitivity of the topology of their phylogeny to methodological changes.
  
 
Open the NEXUS file in a text editor. Look through the file to get a feel for it. Notice the familiar data matrix of nucleotides towards the beginning of the file, and the morphological data matrix below it. The morphological data consists of two states for a given character: 0 for absent, and 1 for present. They assessed some, but not all, of the species in their dataset for morphology. Scroll back up to the top of the document.
 
Open the NEXUS file in a text editor. Look through the file to get a feel for it. Notice the familiar data matrix of nucleotides towards the beginning of the file, and the morphological data matrix below it. The morphological data consists of two states for a given character: 0 for absent, and 1 for present. They assessed some, but not all, of the species in their dataset for morphology. Scroll back up to the top of the document.
  
* ''Why is there no DNA data for taxa Thaites ruminiana or Doritites bosniackii''? {{title|These are the fossil taxa used in there study, and they are too old to provide usable DNA|answer}}
+
* ''Why is there no DNA data for taxa Thaites ruminiana or Doritites bosniackii''? {{title|These are the fossil taxa used in their study, and they are too old to provide usable DNA|answer}}
  
 
Remove the fossil taxa from the nucleotide matrix, and delete the morphological matrix entirely. Return to the top of the document and note the lines:
 
Remove the fossil taxa from the nucleotide matrix, and delete the morphological matrix entirely. Return to the top of the document and note the lines:
Line 42: Line 44:
 
  FORMAT DATATYPE = mixed(DNA:1-4535,Standard:4536-4771) interleave=yes  GAP = - MISSING = ?;
 
  FORMAT DATATYPE = mixed(DNA:1-4535,Standard:4536-4771) interleave=yes  GAP = - MISSING = ?;
  
Our data is no longer "mixed" after taking out the morphological data (i.e. there's no more "Standard" data). Our "DATATYPE" is now simply "DNA". Because it's only DNA, we don't need to specify which characters are DNA. But we still need to specify how many total characters (number of nucleotide sites in this case) are in the dataset, as well as how many taxa. Make the necessary changes to these two lines of the NEXUS file.
+
Our data is no longer "mixed" after taking out the morphological data (i.e. there's no more "Standard" data). Our "DATATYPE" is now simply "DNA". Because it's only DNA, we don't need to specify which characters are DNA. But we still need to specify how many total characters (number of nucleotide sites in this case) are in the dataset, as well as how many taxa. Make the necessary changes to these two lines of the NEXUS file and save the file as apollo.nex
  
 
====Appendix 9====
 
====Appendix 9====
Line 60: Line 62:
 
  8      | GTR+I+G    | EF1_Pos3                      | 3317-4535\3                    | ./analysis/phylofiles/759f0f0ff5902ef4d5a8d520f1a0c4a6.phy
 
  8      | GTR+I+G    | EF1_Pos3                      | 3317-4535\3                    | ./analysis/phylofiles/759f0f0ff5902ef4d5a8d520f1a0c4a6.phy
  
ParitionFinder has determined the partitions and the best-fitting model of nucleotide substitution for each partition. Now we need to get this into a format that IQ-Tree will understand. Although NEXUS files are perfectly capable of housing data and partitioning schemes, IQ-Tree seems to want them separated so we can just add a <tt>sets</tt> block. So this file will tell IQ-Tree how to partition the data in the apollo.nex file you created earlier.
+
ParitionFinder has determined the partitions and the best-fitting model of nucleotide substitution for each partition. Now we need to get this into a format that IQ-Tree will understand. Although NEXUS files are perfectly capable of housing data and partitioning schemes, IQ-Tree seems to want them separated so we can't just add a <tt>sets</tt> block to the apollo.nex file you created earlier. Create a new file named apollo_partition.nex and add the partitioning scheme information to it like so:
 +
 
 +
#nexus
 +
Begin sets;
 +
  charset COI_Pos1 = 1-1489\3;
 +
  charset COI_Pos2_ND1_Pos2_ND5_Pos2 = 2-1490\3 1493-1961\3 1964-2777\3;
 +
  charset COI_Pos3 = 3-1491\3;
 +
  charset 16S_ND1_Pos1_ND5_Pos1 = 1492-1960\3 1963-2776\3 2778-3314;
 +
 
 +
And so on until you have added all of the partitions.
 +
 
 +
Next add a line to indicate which partition gets which model of nucleotide substitution:
 +
 
 +
charpartition favored = GTR+I+G:COI_Pos1, GTR+I+G:COI_Pos2_ND1_Pos2_ND5_Pos2, GTR+I+G:COI_Pos3, GTR+I+G:16S_ND1_Pos1_ND5_Pos1,...
 +
 
 +
And don't forget to <tt>End;</tt> the <tt>sets</tt> block!
 +
 
 +
Now you're ready to run an IQ-Tree analysis!
 +
 
 +
== IQ-Tree on BBC ==
 +
 
 +
Connect to the cluster and transfer your apollo.nex and apollo_partitions.nex files to a folder named "iqtree"
 +
 
 +
We'll need a qsub script to submit our jobs to the cluster, so create one like you did last lab with a few minor modifications to the beginning lines:
 +
 
 +
#$ -cwd
 +
#$ -S /bin/bash
 +
#$ -o iqout.txt
 +
#$ -e iqerr.txt
 +
#$ -m ea
 +
#$ -M your.name@uconn.edu
 +
#$ -N iqtree
 +
 
 +
And add the following commands to load IQ-Tree and to tell it to perform an inference with 1000 ultra-fast bootstrap replicates (NOTE: these are NOT equivalent to non-parametric bootstraps typically used in likelihood analyses to assess clade support):
 +
 
 +
iqtree -s apollo.nex -spp apollo_partition.nex -st DNA -bb 1000
  
== R and ggtree ==
+
See the fantastic [http://www.iqtree.org/doc/iqtree-doc.pdf IQ-Tree documentation] for more information on the parameters used.
  
We've used R before to run our chi-squared tests, but that was from the command line. R has a wonderful integrated development environment (IDE) which is very helpful when you need to write an R script (a sequence of many commands). Download the latest version of R Studio [https://www.rstudio.com/ here].
+
OK you are ready to submit your job! Use <tt>qstat</tt> to check on the status of it: it shouldn't take longer than 15 minutes.
  
 
== Literature Cited ==
 
== Literature Cited ==
 
red queen hypothesis
 
red queen hypothesis
 
court-jester hypothesis
 
court-jester hypothesis
ggtree
 
 
iqtree
 
iqtree

Latest revision as of 14:46, 22 February 2019

Adiantum.png EEB 5349: Phylogenetics

by Kevin Keegan

Goals

To introduce you to the maximum likelihood software IQ-Tree.

Introduction

IQ-Tree is a relative newcomer to the maximum likelihood, so let's road-test it. If that's not cutting-edge enough for you, we will be using them with the data from a 2018 study that was accepted on February 9th. It's not even officially published yet! The authors of the study used a Bayesian phylogeny as the basis for their study, but did not show results for a maximum likelihood phylogeny. Typically, studies present the results of both analyses, to examine any sensitivity their data may have to the phylogenetic inference method used. In order to help the authors out, we will infer a maximum likelihood phylogeny for them, and see if it recovers the same topology as their Bayesian phylogeny.

The Study

The study by Condamine et al. (2018) focuses on Apollo (Parnassius) butterflies which live high up in alpine regions of the Holarctic, and have long been a favorite genus for collectors due to their beauty and relatively high endemism (collectors like hard to find things). Being so well-known, their taxonomy is also well-known and relatively stable. In other words, we know what all the species are. This makes the group a good candidate for the type of comparative phylogenetic study that the authors performed. Being alpine specialists, Apollo butterflies are declining worldwide as they move up mountains, trying to keep pace with rising temperatures. Also of note is the use of mating plugs by male Apollos to control paternity.

The authors were interested in testing test two hypotheses about the main drivers of diversification and macroevolution: the Red Queen and Court Jester Hypotheses. The Red Queen hypothesis, coined by Leigh Van Valen in 1973, posits that species must constantly adapt through evolutionary time in order to survive in a world of ever-evolving species (it's an allusion to Through the Looking-Glass by Lewis Carroll). It stresses the importance of biotic interactions as drivers of diversification. On the other hand, The Court Jester Hypothesis, coined by Anthony Barnosky in 1999, posits that it is interactions between a given lineage and the non-living world around it that are the main drivers of diversification and macroevolution, thus emphasizing the importance of abiotic factors in driving diversification.

In order to test these hypotheses the authors first needed a time-calibrated phylogeny. They used previously published DNA data from Apollo (Parnassius), coupled with two fossil specimens to generate a complete (as in all 85 living species in the genus are included) time-calibrated phylogeny. Knowing the absolute, and not just relative dates of the nodes, the authors were able to use historical climatic and geological data (court-jester), as well as inferred ancestral species ranges (Red Queen) to try to tease apart who rules the Parnassian Court: The Red Queen or the Jester?

Get Their Data

The authors deposited the data they used in the digital repository Dryad. Go to Dryad and search for the authors' study. Each study has a digital object identifier (DOI) that uniquely identifies it and can be used to easily search for it. The DOIs are typically printed near the beginning of an article as part of a URL. The DOI number is everything after the ".org". "https://doi.org/[everything here is the DOI number]"

We need the data they used in their MrBayes inference (Appendix 3), and information about how they partitioned their data (Appendix 9).

Appendix 3

Appendix 3 is simply the NEXUS file they passed to MrBayes: it has character state information along with commands to MrBayes about how to partition the data. The character state information in this file is from both DNA and morphological data. When these two sources of information are used to infer a phylogeny it's termed a "total-evidence" approach (which seems a little presumptuous to this author!)

IQ-Tree can handle combined morphological and molecular datasets. Despite IQ-Tree being able to handle the combined data set of Condamine et al. (2018) we're going to simplify things a bit and just infer a phylogeny with the DNA data. Yes we are changing two variables (changing the inference framework from Bayesian to maximum likelihood and removing morphological data), but nonetheless our analysis will serve to test the sensitivity of the topology of their phylogeny to methodological changes.

Open the NEXUS file in a text editor. Look through the file to get a feel for it. Notice the familiar data matrix of nucleotides towards the beginning of the file, and the morphological data matrix below it. The morphological data consists of two states for a given character: 0 for absent, and 1 for present. They assessed some, but not all, of the species in their dataset for morphology. Scroll back up to the top of the document.

  • Why is there no DNA data for taxa Thaites ruminiana or Doritites bosniackii? answer

Remove the fossil taxa from the nucleotide matrix, and delete the morphological matrix entirely. Return to the top of the document and note the lines:

DIMENSIONS  NTAX=96 NCHAR=4771;
FORMAT DATATYPE = mixed(DNA:1-4535,Standard:4536-4771) interleave=yes  GAP = - MISSING = ?;

Our data is no longer "mixed" after taking out the morphological data (i.e. there's no more "Standard" data). Our "DATATYPE" is now simply "DNA". Because it's only DNA, we don't need to specify which characters are DNA. But we still need to specify how many total characters (number of nucleotide sites in this case) are in the dataset, as well as how many taxa. Make the necessary changes to these two lines of the NEXUS file and save the file as apollo.nex

Appendix 9

Appendix 9 has the results of their ParitionFinder analysis. PartitionFinder is one of many programs that will take a look at your data and attempt to place similarly evolving sites into the same bin or partition. One can also parition their data manually using a priori knowledge about their sites (e.g. put different genes into different partitions, and different codon positions into different partitions.)

Open up the "best_scheme.txt file and note the lines:

Subset | Best Model | Subset Partitions              | Subset Sites                   | Alignment                               
1      | GTR+I+G    | COI_Pos1                       | 1-1489\3                       | ./analysis/phylofiles/ca30e3ea7b9c739c115638123d4f1dd0.phy
2      | TIM+I+G    | COI_Pos2, ND1_Pos2, ND5_Pos2   | 2-1490\3, 1493-1961\3, 1964-2777\3 | ./analysis/phylofiles/c2e9c416b67aadc24570829688eefbda.phy
3      | TIM+G      | COI_Pos3                       | 3-1491\3                       | ./analysis/phylofiles/5d40dcf0c7fe8811737c39985b3a08af.phy
4      | GTR+I+G    | 16S, ND1_Pos1, ND5_Pos1        | 1492-1960\3, 1963-2776\3, 2778-3314 | ./analysis/phylofiles/118cc2b9a278fd1acb34d328d95d92f6.phy
5      | TIM+G      | ND1_Pos3, ND5_Pos3             | 1494-1962\3, 1965-2775\3       | ./analysis/phylofiles/3c2d866b9dc71dad74d7571f2ca99381.phy
6      | TIM+I+G    | EF1_Pos1                       | 3315-4533\3                    | ./analysis/phylofiles/9980e10b91c1a16683fe75f404068d6f.phy
7      | TrNef+I+G  | EF1_Pos2                       | 3316-4534\3                    | ./analysis/phylofiles/aea6ba92a6c9f49b27ab397b8cb58e28.phy
8      | GTR+I+G    | EF1_Pos3                       | 3317-4535\3                    | ./analysis/phylofiles/759f0f0ff5902ef4d5a8d520f1a0c4a6.phy

ParitionFinder has determined the partitions and the best-fitting model of nucleotide substitution for each partition. Now we need to get this into a format that IQ-Tree will understand. Although NEXUS files are perfectly capable of housing data and partitioning schemes, IQ-Tree seems to want them separated so we can't just add a sets block to the apollo.nex file you created earlier. Create a new file named apollo_partition.nex and add the partitioning scheme information to it like so:

#nexus
Begin sets;
 charset COI_Pos1 = 1-1489\3;
 charset COI_Pos2_ND1_Pos2_ND5_Pos2 = 2-1490\3 1493-1961\3 1964-2777\3;
 charset COI_Pos3 = 3-1491\3;
 charset 16S_ND1_Pos1_ND5_Pos1 = 1492-1960\3 1963-2776\3 2778-3314;

And so on until you have added all of the partitions.

Next add a line to indicate which partition gets which model of nucleotide substitution:

charpartition favored = GTR+I+G:COI_Pos1, GTR+I+G:COI_Pos2_ND1_Pos2_ND5_Pos2, GTR+I+G:COI_Pos3, GTR+I+G:16S_ND1_Pos1_ND5_Pos1,...

And don't forget to End; the sets block!

Now you're ready to run an IQ-Tree analysis!

IQ-Tree on BBC

Connect to the cluster and transfer your apollo.nex and apollo_partitions.nex files to a folder named "iqtree"

We'll need a qsub script to submit our jobs to the cluster, so create one like you did last lab with a few minor modifications to the beginning lines:

#$ -cwd
#$ -S /bin/bash
#$ -o iqout.txt
#$ -e iqerr.txt
#$ -m ea
#$ -M your.name@uconn.edu
#$ -N iqtree

And add the following commands to load IQ-Tree and to tell it to perform an inference with 1000 ultra-fast bootstrap replicates (NOTE: these are NOT equivalent to non-parametric bootstraps typically used in likelihood analyses to assess clade support):

iqtree -s apollo.nex -spp apollo_partition.nex -st DNA -bb 1000

See the fantastic IQ-Tree documentation for more information on the parameters used.

OK you are ready to submit your job! Use qstat to check on the status of it: it shouldn't take longer than 15 minutes.

Literature Cited

red queen hypothesis court-jester hypothesis iqtree