Syst. Biol. 47(1):9-17, 1998

Is it better to add taxa or characters to a difficult phylogenetic problem?

Anna Graybeal 1

Department of Zoology, University of Texas, Austin, Texas 78712, USA

Abstract.---The effects on phylogenetic accuracy of adding characters and/or taxa were explored using data generated by computer simulation. The conditions of this study were constrained but allowed for systematic investigation of certain parameters. The starting point for the study was a four-taxon tree in the "Felsenstein zone," representing a difficult phylogenetic problem with an extreme situation of long branch attraction. Taxa were added sequentially to this tree in a manner specifically designed to break up the long branches, and for each tree data matrices of different sizes were simulated. Phylogenetic trees were reconstructed from these data using the criteria of parsimony and maximum likelihood. Phylogenetic accuracy was measured in three ways: (1) proportion of trees that are completely correct, (2) proportion of correctly reconstructed branches in all trees, and (3) proportion of trees in which the original four-taxon statement is correctly reconstructed. Accuracy improved dramatically with the addition of taxa and much more slowly with the addition of characters. If taxa can be added to break up long branches, it is much more preferable to add taxa than characters.
[Long branch attraction, parsimony, phylogenetic reconstruction, simulation, taxon sampling.]

1 Present address: Department of Zoology, Field Museum of Natural History, Roosevelt Road and Lake Shore Drive, Chicago, Illinois 60605, USA. E-mail: graybeal@fmppr.fmnh.org.


Syst. Biol. 47(1):18-31, 1998

Sensitivity of phylogeny estimation to taxon sampling

Steven Poe 1

Department of Zoology and Texas Memorial Museum, University of Texas, Austin, Texas 78712-1064, USA;
E-mail: stevepoe@mail.utexas.edu

Abstract.---Recent studies have shown that addition or deletion of taxa from a data matrix can change the estimate of phylogeny. I used twenty-nine data sets from the literature to examine the effect of taxon sampling on phylogeny estimation within data sets. I then used multiple regression to assess the effect of number of taxa, number of characters, homoplasy, strength of support, and tree symmetry on the sensitivity of data sets to taxonomic sampling. Sensitivity to sampling was measured by mapping characters from a matrix of culled taxa onto optimal trees for that reduced matrix and onto the pruned optimal tree for the entire matrix, then comparing the length of the reduced tree to the length of the pruned complete tree. Within-data-set patterns can be described by a second-order equation relating fraction of taxa sampled to sensitivity to sampling. Multiple regression analyses found number of taxa to be a significant predictor of sensitivity to sampling; retention index, number of informative characters, total support index, and tree symmetry were nonsignificant predictors. I derived a predictive regression equation relating fraction of taxa sampled and number of taxa potentially sampled to sensitivity to taxonomic sampling and calculated values for this equation within the bounds of the variables examined. The length difference between the complete tree and a subsampled tree was generally small (average difference of 0-2.9 steps), indicating that subsampling taxa is probably not an important problem for most phylogenetic analyses using up to 20 taxa.
[Taxonomic sampling; phylogeny estimation; multiple regression; modeling]

1 Address until November 1, 1998: Division of Amphibians and Reptiles, National Museum of Natural History, Smithsonian Institution, Washington, DC 20560.


Syst. Biol. 47(1):32-42, 1998

Inferring complex phylogenies using parsimony: an empirical approach using three large DNA data sets for angiosperms

Douglas E. Soltis 1, Pamela S. Soltis 1, Mark E. Mort 1, Mark W. Chase 2, Vincent Savolainen 2, Sara B. Hoot 3, and Cynthia M. Morton 4

1 Department of Botany, Washington State University, Pullman, Washington 99164-4238, USA;
E-mail: dsoltis@mail.wsu.edu (D.E.S.), psoltis@wsu.edu (P.S.S.), markmort@mail.wsu.edu (M.E.M.)

2 Molecular Systematics Section, Jodrell Laboratory, Royal Botanic Gardens, Kew, Richmond, TW9 3DS, United Kingdom;
E-mail: mc03kg@lion.rbgkew.org.uk

3 Department of Biological Sciences, University of Wisconsin, Milwaukee, Wisconsin 53201, USA;
E-mail: hoot@csd.uwm.edu

4 Department of Botany, University of Reading, Reading RG6 2AS, United Kingdom;
E-mail: c.morton@reading.ac.uk

Abstract.---To explore the feasibility of parsimony analysis for large data sets, we conducted heuristic parsimony searches and bootstrap analyses on separate and combined DNA data sets for 190 angiosperms and three outgroups. Separate data sets of 18S rDNA (1,855 bp), rbcL (1,428 bp), and atpß (1,450 bp) sequences were combined into a single matrix 4,733 bp in length. Analyses of the combined data set show great improvements in computer run times compared to those of the separate data sets and of the data sets combined in pairs. Six searches of the 18S rDNA + rbcL + atpß data set were conducted; in all cases TBR branch swapping was completed, generally within a few days. In contrast, TBR branch swapping was not completed for any of the three separate data sets, or for the pairwise combined data sets. These results illustrate that it is possible to conduct a thorough search of tree space with large data sets, given sufficient signal. In this case, and probably most others, sufficient signal for a large number of taxa can only be obtained by combining data sets. The combined data sets also have higher internal support for clades than the separate data sets, and more clades receive bootstrap support of >50% in the combined analysis than in analyses of the separate data sets. These data suggest that one solution to the computational and analytical dilemmas posed by large data sets is the addition of nucleotides, as well as taxa.
[Large data sets, parsimony, phylogeny.]


Syst. Biol. 47(1):43-60, 1998

Large-scale phylogenies and measuring the performance of phylogenetic estimators

Junhyong Kim

Department of Biology, Yale University, New Haven, Connecticut 06511, USA;
E-mail: junhyong_kim@quickmail.yale.edu

Abstract.---Performance measures of phylogenetic estimation methods such as accuracy, consistency, and power are an attempt at summarizing an ensemble of a given estimator's behavior. These summaries characterize an ensemble behavior with a single number, leading to a variety of definitions. In particular, the relationship between different performance measures such as accuracy and consistency or accuracy and error depend on the exact definition of these measures. In addition, it is relatively common to use large-sample behavior to infer similar behavior for small samples. In fact, large-sample results such as the claimed asymptotic efficiency of the maximum likelihood estimator are often uninformative for small samples. Conversely, small-sample behavior using simulations is sometimes used to imply large-sample behavior such as consistency. However, such extrapolation is often difficult. How the performance of a phylogenetic estimator scales with the addition of taxa must be qualified with respect to whether the whole tree is being estimated or a fixed subset of taxa is being estimated. It must also be qualified with respect to how tree models are sampled. Over the ensemble of all possible trees of a given size, the performance of the estimators for the whole tree estimate suffers when the tree size becomes larger. However, under certain models of cladogenesis, the estimate can improve with the addition of taxa. In fact, at all numbers of taxa there are subsets of tree models that are easier to estimate than others. This suggests that with judicious addition or subtraction of taxa we can move from tree models that are more difficult to estimate at one number of taxa to those that are easier to estimate at another number of taxa.
[Accuracy; consistency; efficiency; large-scale phylogeny; performance]


Syst. Biol. 47(1):61-76, 1998

Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences

Gavin J. P. Naylor 1 and Wesley M. Brown 2

1 Department of Zoology and Genetics, Iowa State University, Ames, Iowa 50011, USA;
E-mail: gnaylor@iastate.edu

2 Department of Biology, University of Michigan, Ann Arbor, Michigan, USA

Abstract.---Analyses of both the nucleotide and amino acid sequences derived from all 13 mitochondrial protein-encoding genes (12,234 bp) of 19 metazoan species, including that of the lancelet Branchiostoma floridae ("amphioxus"), fail to yield the widely accepted phylogeny for chordates and, within chordates, for vertebrates. Given the breadth and the compelling nature of the data supporting that phylogeny, relationships supported by the mitochondrial sequence comparisons are almost certainly incorrect, despite their being supported by equally weighted parsimony, distance and maximum likelihood analyses. The incorrect groupings probably result in part from convergent base-compositional similarities among some of the taxa, similarities that are strong enough to overwhelm the historical signal. Comparisons among very distantly related taxa are likely to be particularly susceptible to such artifacts, because the historical signal is already greatly attenuated. Empirical results underscore the need for approaches to phylogenetic inference that go beyond simple site-by-site comparison of aligned sequences. This study and others indicate that, once a sequence sample of reasonable size has been obtained, accurate phylogenetic estimation may be better served by incorporating knowledge of molecular structures and processes into inference models and by seeking additional higher order characters embedded in those sequence, than by gathering ever larger sequence samples from the same organisms in the hope that the historical signal will eventually prevail.
[Amphioxus; chordate phylogeny; homoplasy; mtDNA; molecular systematics; phylogenetic inference]


Syst. Biol. 47(1):77-89, 1998

A fast method for approximating maximum likelihoods of phylogenetic trees from nucleotide sequences

James S. Rogers 1 and David L. Swofford 2

1 Department of Biological Sciences, University of New Orleans, New Orleans, Louisiana 70148, USA;
E-mail: jsrbs@uno.edu

2 Laboratory of Molecular Systematics, MRC-534, Smithsonian Institution, Washington, DC 20560, USA;
E-mail: swofford@onyx.si.edu

Abstract.---We have developed a rapid parsimony method for reconstructing ancestral nucleotide states that allows calculation of initial branch lengths that are good approximations to optimal maximum likelihood estimates under several commonly used substitution models. Use of these approximate branch lengths (rather than fixed arbitrary values) as starting points significantly reduces the time required for iteration to a solution that maximizes the likelihood of a tree. These branch lengths are close enough to the optimal values that they can be used without further iteration to calculate approximate maximum likelihood scores that are very close to the "exact" scores found by iteration. Several strategies are described for using these approximate scores to substantially reduce times needed for maximum likelihood tree searches.
[Approximations; maximum likelihood; parsimony; tree searching.]


Syst. Biol. 47(1):90-124, 1998

Morphology, molecules, and the phylogenetics of cetaceans

Sharon L. Messenger 1, 3 and Jimmy A. McGuire 1, 2

1 Department of Zoology, University of Texas, Austin, Texas 78712-1064, USA;

2 Texas Memorial Museum, University of Texas, Austin, Texas 78705, USA;
E-mail: jmcguire@mail.utexas.edu

Abstract.---Recent phylogenetic analyses of cetacean relationships based on DNA sequence data have challenged the traditional view that baleen whales (Mysticeti) and toothed whales (Odontoceti) are each monophyletic, arguing instead that baleen whales are the sister group of the odontocete family Physeteridae (sperm whales). We reexamined this issue in light of a morphological data set composed of 207 characters and molecular data sets of published 12S, 16S, and cytochrome b mitochondrial DNA sequences. We reach four primary conclusions: (1) Our morphological data set strongly supports the traditional view of odontocete monophyly; (2) the unrooted molecular and morphological trees are very similar, and most of the conflict results from alternative rooting positions; (3) the rooting position of the molecular tree is sensitive to choice of artiodactyl outgroup taxa and the treatment of two small but ambiguously aligned regions of the 12S and 16S sequences, whereas the morphological root is strongly supported; and (4) combined analyses of the morphological and molecular data provide a well-supported phylogenetic estimate consistent with that based on the morphological data alone (and the traditional view of toothed-whale monophyly) but with increased bootstrap support at nearly every node of the tree.
[Cetacea, DNA sequences, likelihood-ratio test, molecular clock, morphology, Mysticeti, Odontoceti, partition homogeneity test, phylogeny, Templeton test].

3 Present Address: Centers for Disease Control and Prevention, Division of Viral and Rickettsial Diseases, MS G-33, 1600 Clifton Road, Atlanta, Georgia 30333, USA;
E-mail: sum4@cdc.gov


Syst. Biol. 47(1):125-133, 1998

On the best evolutionary rate for phylogenetic analysis

Ziheng Yang 1

Department of Integrative Biology, University of California, Berkeley, California 94720-3140, USA

Abstract.---The effect of the evolutionary rate of a gene on the accuracy of phylogeny reconstruction was examined by computer simulation. The evolutionary rate is measured by the tree length, that is, the expected total number of nucleotide substitutions per site on the phylogeny. DNA sequence data were simulated using both fixed trees with specified branch lengths and random trees with branch lengths generated from a model of cladogenesis. The parsimony and likelihood methods were used for phylogeny reconstruction, and the proportion of correctly recovered branch partitions by each method was estimated. Phylogenetic methods including parsimony appear quite tolerant of multiple substitutions at the same site. The optimum levels of sequence divergence were even higher than upper limits previously suggested for saturation of substitutions, indicating that the problem of saturation may have been exaggerated. Instead, the lack of information at low levels of divergence should be seriously considered in evaluation of a gene's phylogenetic utility, especially when the gene sequence is short. The performance of parsimony, relative to that of likelihood, does not necessarily decrease with the increase of the evolutionary rate.
[Branch lengths; homoplasy; likelihood; parsimony; phylogeny; optimum evolutionary rate; saturation; simulation.]

1 Present address: Department of Biology, University College London, Galton Lab, 4 Stephenson Way, London NW1 2HE, England.