Koepfli and Wayne
Abstract.We compared the utility of five nuclear gene segments amplified with type I sequence-tagged site (STS) primers versus the complete mitochondrial cytochrome b (cyt b) gene in resolving phylogenetic relationships within the Mustelidae, a large and ecomorphologically diverse family of mammalian carnivores. Maximum parsimony and likelihood analyses of separate and combined data sets were used to address questions regarding the levels of homoplasy, incongruence, and information content within and among loci. All loci showed limited resolution in the separate analyses either because of a low amount of informative variation (nuclear genes) or high levels of homoplasy (cyt b). Individually or combined, the nuclear gene sequences had less homoplasy, retained more signal, and were more decisive, even though cyt b contained more potentially informative variation than all the nuclear sequences combined. We obtained a well-resolved and supported phylogeny when the nuclear sequences were combined. Maximum likelihood and Bayesian phylogenetic analyses of the total combined data (nuclear and mitochondrial DNA sequences) was able to better accommodate the high levels of homoplasy in the cyt b data than an equally weighted maximum parsimony analysis. Furthermore, partition Bremer support analyses of the total combined tree showed that the relative support of the nuclear and mitochondrial genes differed according to whether or not the homoplasy in the cyt b gene was downweighted. While the cyt b gene contributed phylogenetic signal for most major groupings, the nuclear gene sequences were more effective in reconstructing the deeper nodes of the combined tree in the equally weighted parsimony analysis, as judged by the variable-length bootstrap method. The total combined data supported the monophyly of the Lutrinae (otters), while the Melinae (badgers) and Mustelinae (weasels, martens) were both paraphyletic. The American badger, Taxidea taxus (Taxidiinae), was the most basal taxon. Because hundreds of type I STS primer sets spanning the complete genomes of the human and mouse have been published and thus represent many independently segregating loci, the potential utility of these markers to the molecular systematics of mammals and other groups is enormous.
Susko et al.
Abstract.Previous work has shown that it is often essential to account for the variation in rates at different sites in phylogenetic models in order to avoid phylogenetic artifacts such as long branch attraction. In most current models, the gamma distribution is used for the rates-across-sites distributions and is implemented as an equal-probability discrete gamma. In this paper, we introduce discrete distribution estimates with large numbers of equally spaced rate categories allowing us to investigate the appropriateness of the gamma model. With large numbers of rate categories, these discrete estimates are flexible enough to approximate the shape of almost any distribution. Likelihood ratio statistic tests and a nonparametric bootstrap confidence bound estimation procedure based on the discrete estimates are presented that can be used to test the fit of a parametric family. We apply the methodology to several different protein data sets and find that although the gamma model often provides a good parametric model for this type of data, rate estimates from an equal-probability discrete gamma model with a small number of categories will tend to underestimate the largest rates. In cases when the gamma model assumption is in doubt, rate estimates coming from the discrete rate distribution estimate with a large number of rate categories provide a robust alternative to gamma estimates. An alternative implementation of the gamma distribution is proposed that, for equal numbers of rate categories, is computationally more efficient during optimization than the standard gamma implementation and can provide more accurate estimates of site rates.
Abstract.Homoplasy among morphological characters has hindered inference of higher-level rodent phylogeny for over 100 years. Initial molecular studies, based primarily on single genes, likewise produced little resolution of the deep relationships among rodent families. Two recent molecular studies (Huchon et al., 2002; Adkins et al., 2003), using larger samples from the nuclear genome, have produced phylogenies that are generally concordant with each other, but many of the deep superfamilial nodes were still lacking substantial statistical support. Data are presented here for a total of approximately 3,600 bp from portions of three different nuclear protein-coding genes, CB1, IRBP, and RAG2 from 19 rodents and 3 outgroups. Separate analyses, with data partitioned according to both genes and codon position, produced conflicting results. Trees obtained from all partitions of CB1 and RAG2, and those obtained from the first- plus second-position sites of IRBP were generally concordant with each other and the trees from the two recent studies, while trees obtained from the third-position sites of IRBP were not. Although the IRBP third-position sites represent only 1/9 of the total data set, combined analyses using either parsimony or likelihood resulted in trees in agreement with the IRBP third-position sites and in disagreement with the remaining 8/9 of the sites from this data set and the two recent multi-gene studies. In contrast, maximum likelihood analysis using a site-specific rates model did recover a tree that is highly congruent with the trees in the two recent studies. If the IRBP third-position sites are removed from the current data set, then combined likelihood analyses obtains a tree that highly congruent with the two recent studies. This analysis also provides, for the first time in a study of rodent phylogeny, robust statistical support for every bipartition, with just one exception. This tree divides rodents into two major clades. The first contains Myodonta (Muroidea plus Dipodidae) and the only unresolved trichotomy, from which descend Geomyoidea, Pedetidae, and Castoridae. On the other side of the root is a clade containing (Sciuroidea plus Gliridae), and Hystricognathi. Some uncertainty remains on the placement of the root. Trees on which the Hystricognathi are the basal sister-group to Myodonta, Geomyoidea, Pedetidae, and Castoridae are also found within a Bayesian 95% credible set, as estimated by MCMCMC sampling.
Abstract.Although calyptraeid gastropods are not well understood taxonomically, in part because their simple plastic shells are the primary taxonomic character, they provide an ideal system to examine questions about evolution in the marine environment. I conducted a phylogenetic analysis of calyptraeid gastropods using DNA sequence data from mitochondrial cytochrome oxidase I (COI) and 16S, and nuclear 28S genes. The resultant phylogeny was used to examine the biogeographic patterns of speciation in this family. Parsimony and Bayesian analyses of the combined datasets for 94 calyptraeid OTUs and 24 outgroups produced well-resolved phylogenies. Both approaches result in identical sister species relationships and the few differences in deeper topology do not affect biogeographic inferences. The geographic distribution of the species included here demonstrate numerous dispersal events both between the Pacific and Atlantic oceans and across the equator. When parsimony is used to reconstruct the movement from the Pacific to Atlantic oceans on the phylogeny, there are 12 transitions between oceans, primarily from the Pacific to the Atlantic. When the latitude is coded as north versus south of the equator, the most parsimonious reconstruction gives the origin of calyptraeids in the north followed by 15 dispersal events to regions south of the equator, and no returns to the north. Many clades of the most closely related species are either sympatric or occur along a single coastline. Closely related species can, however, occur in such divergent regions as Southern California and South Africa. There is little evidence for sister species pairs or larger clades having been split by the Isthmus of Panama or the Benguela upwelling, but the East Pacific Barrier appears to separate the most basal taxa from the rest of the family.
Huelsenbeck and Lander
Abstract.Although the conditions under which the parsimony method becomes inconsistent have been studied for almost two decades, the probability that the parsimony method would encounter conditions causing inconsistency under simple models of cladogenesis is unknown. Here we examine the statistical behavior of the parsimony method under a birth-death model of cladogenesis, when the molecular clock holds. The parsimony method can become inconsistent a high proportion of the time even under this simple model of cladogenesis. When taxon sampling is poor or rates of evolution are high, the probability that parsimony will become inconsistent increases.
Suchard et al.
Abstract.Debate exists over how to incorporate information from multipartite sequence data in phylogenetic analyses. Strict combined-data approaches argue for concatenation of all partitions and estimation of one evolutionary history, maximizing explanatory power of the data. Consensus/independence approaches endorse a two-step procedure where partitions are analyzed independently and then a consensus is determined from the multiple results. Mixtures across the model space of a strict combined-data approach and a priori independent parameters are popular methods to integrate these methods. We propose an alternative middle ground by constructing a Bayesian hierarchical phylogenetic model. Our hierarchical framework enables researchers to pool information across data partitions in order to improve estimate precision in individual partitions while simultaneously permitting and testing of tendencies in across-partition quantities. Such across-partition quantities include the distribution from which individual topologies relating the sequences within a partition are drawn. We propose standard hierarchical priors on continuous evolutionary parameters across partitions, while the structure on topologies varies depending on the research problem. We illustrate our model with three examples. We first explore the evolutionary history of the guinea pig using alignments of 13 mitochondrial genes. The hierarchical model returns substantially more precise continuous parameter estimates than an independent parameter approach without losing the salient features of the data. Second, we analyze the frequency of horizontal gene transfer using 50 prokaryotic genes. We assume an unknown species-level topology and allow individual gene topologies to differ from this with a small estimable probability. Simultaneously inferring the species and individual gene topologies returns a transfer frequency of 17%. We also examine HIV sequences longitudinally sampled from HIV+ patients. We ask whether post-treatment development of CCR5 coreceptor virus represents concerted evolution from mid-disease CXCR4 virus or re-emergence of initial infecting CCR5 virus. The hierarchical model pools partitions from multiple unrelated patients by assuming that the topology for each patient is drawn from a multinomial distribution with unknown probabilities. Preliminary results suggest evolution and not re-emergence.
Erixon et al.
Abstract.Many empirical studies have revealed considerable differences between non-parametric bootstrapping and Bayesian posterior probabilities in terms of the support values for branches, despite claimed predictions about their approximate equivalence. We investigated this problem by simulating data, which were then analyzed by maximum likelihood bootstrapping and Bayesian phylogenetic analysis, using identical models and re-optimization of parameter values. We show that Bayesian posterior probabilities are significantly higher than corresponding non-parametric bootstrap frequencies for true clades, but also that erroneous conclusions will be made more often. This is strongly accentuated when the models used for analyses are under-parameterized. If data are analyzed under the correct model, non-parametric bootstrapping is conservative. Bayesian posterior probabilities are also conservative in this respect, but less so.
Minin et al.
Abstract.Phylogenetic estimation has largely come to rely on explicitly model-based methods. This approach requires that a model be chosen, and that that choice be justified. To date, this has largely been accomplished through use of likelihood-ratio tests (LRT's) to assess the relative fit of a nested series of reversible models. While this certainly represents an important advance over arbitrary model selection, the best fit of a series of models may not always provide the most reliable phylogenetic estimates for finite real data sets, where all available models are surely incorrect. Here, we develop a novel approach to model selection, which is based on the Bayesian Information Criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework. This DT method includes a penalty for over fitting, is applicable prior to running extensive analyses, and simultaneously compares all models being considered and thus does not rely on a series of pairwise comparisons of models to traverse model space. We evaluate this method by examining four real data sets and by using those data sets to define simulation conditions. In the real data sets, the DT method selects the same or simpler models than conventional LRT's. In order to lend generality to the simulations, codon-based models (with parameters estimated from the real data sets) were used to generate simulated data sets which are therefore more complex than any of the models we evaluate. On average, the DT selects models that are simpler than those chosen by conventional LRT's. Nevertheless, these simpler models provide estimates of branch lengths that are more accurate both in terms of relative error and absolute error than those derived using the more complex (yet still wrong) models chosen by conventional LRT's. This method is available in a program called DT-ModSel.
Abstract.A phylogenetic comparative method is proposed for estimating historical effects on comparative data using the partitions that compose a cladogram--its monophyletic groups. Two basic matrices, Y and X, are defined in the context of an ordinary linear model. Y contains the comparative data measured over t taxa. X consists of an initial tree matrix that contains all the xj monophyletic groups (each coded separately as a binary indicator variable) of the phylogenetic tree available for those taxa. The method seeks to define the subset of groups--a reduced tree matrix--that "best" explains the patterns in Y. This is accomplished via regression or canonical ordination (depending on the dimensionality of Y) coupled with Monte Carlo permutations. It is argued here that unrestricted permutations (i.e., under an equiprobable model) are valid for testing this specific kind of group-wise hypotheses and are thus employed. Phylogeny is either partialled out or, more properly, incorporated into the analysis in the form of component variation. Direct extensions allow for testing ecomorphological data controlled by phylogeny in a variation partitioning approach. Currently available statistical techniques make this method applicable under most univariate/multivariate models and metrics; two-way phylogenetic effects can be estimated as well. The simplest case (univariate Y), tested with simulations, yielded acceptable type I error rates. Applications presented include examples in evolutionary ethology, ecology, and ecomorphology. Results showed that the new technique detected previously overlooked variation clearly associated with phylogeny, and that many phylogenetic effects on comparative data may occur at particular groups rather than across the entire tree.
Guindon and Gascuel
Abstract.The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models, necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method, and modifies this tree so as to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We use extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum likelihood programs, and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 minutes are required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 bp from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program which is freely available on our web page.
Yang and Yoder
Abstract.Divergence time and substitution rate are seriously confounded in phylogenetic analysis, making it difficult to estimate divergence times when the molecular clock (rate constancy among lineages) is violated. This problem can be alleviated to some extent by analyzing multiple gene loci simultaneously and by using multiple calibration points. While different genes may have different patterns of evolutionary rate change, they share the same divergence times. Indeed, the fact that each gene may violate the molecular clock differently leads to the advantage of simultaneous analysis of multiple loci. Multiple calibration points provide the means for characterizing the local evolutionary rates on the phylogeny. In this paper, we extend previous likelihood models of local molecular clock for estimating species divergence times to accommodate multiple calibration points and multiple genes. Heterogeneity among different genes in evolutionary rate and in substitution process is accounted for by the models. We apply the likelihood models to analyze two mitochondrial protein-coding genes, COII and cytochrome b, to estimate divergence times of Malagasy mouse lemurs and related outgroups. The likelihood method is compared with the Bayes method of Thorne et al. (1998), which uses a probabilistic model to describe the change of evolutionary rate over time and uses Markov chain Monte Carlo to derive the posterior distribution of rates and times. Our likelihood implementation has the drawbacks of failing to accommodate uncertainties in fossil calibrations and of requiring the researcher to classify branches on the tree into different rate groups. Both problems are avoided in the Bayes method. Despite the differences in the two methods, however, we found that data partitions and model assumptions had the greatest impact on date estimation. The three codon positions have very different substitution rates and evolutionary dynamics, and assumptions in the substitution model affect date estimation in both likelihood and Bayes analyses. The results demonstrate that the separate analysis is unreliable, with dates variable among codon positions and between methods, and that the combined analysis is much more reliable. When the three codon positions are analyzed simultaneously under the most realistic models using all available calibration information, the two methods produced similar results. The divergence of the mouse lemurs is dated to be around 7-10 million years ago, indicating a surprisingly early species radiation for such a morphologically uniform group of primates.
Hausdorf and Hennig
Abstract.Biotic element analysis is an alternative to the areas of endemism approach for recognizing the presence or absence of vicariance events in a given region. If an ancestral biota was fragmented by vicariance events, biotic elements, clusters of distribution areas, will emerge. We propose a statistical test for clustering of distribution areas based on a Monte Carlo simulation with a null model which considers the spatial autocorrelation in the data. The hypothesis is tested that the observed degree of clustering of ranges can be explained by the range size distribution, the varying number of taxa per cell and the spatial autocorrelation of the occurrences of a taxon alone. A method for the delimitation of biotic elements which uses Model based Gaussian clustering is introduced. We demonstrate our methods and show the importance of grid size by means of a case study, an analysis of the distribution patterns of southern African species of the weevil genus Scobius. The example highlights the difficulties in delimitating areas of endemism if dispersal occurred and illustrates the advantages of the biotic element approach.