Abstract.---The effects on phylogenetic accuracy of adding characters and/or
taxa were explored using data generated by computer simulation. The conditions of this
study were constrained but allowed for systematic investigation of certain parameters.
The starting point for the study was a four-taxon tree in the "Felsenstein zone,"
representing a difficult phylogenetic problem with an extreme situation of long branch
attraction. Taxa were added sequentially to this tree in a manner specifically designed
to break up the long branches, and for each tree data matrices of different sizes were
simulated. Phylogenetic trees were reconstructed from these data using the criteria of
parsimony and maximum likelihood. Phylogenetic accuracy was measured in three ways: (1)
proportion of trees that are completely correct, (2) proportion of correctly reconstructed
branches in all trees, and (3) proportion of trees in which the original four-taxon
statement is correctly reconstructed. Accuracy improved dramatically with the addition of
taxa and much more slowly with the addition of characters. If taxa can be added to break up
long branches, it is much more preferable to add taxa than characters.
[Long branch attraction, parsimony, phylogenetic reconstruction, simulation, taxon sampling.]
1 Present address: Department of Zoology, Field Museum of Natural History, Roosevelt
Road and Lake Shore Drive, Chicago, Illinois 60605, USA. E-mail: graybeal@fmppr.fmnh.org.
Abstract.---Recent studies have shown that addition or deletion of taxa from a data
matrix can change the estimate of phylogeny. I used twenty-nine data sets from the
literature to examine the effect of taxon sampling on phylogeny estimation within data
sets. I then used multiple regression to assess the effect of number of taxa, number of
characters, homoplasy, strength of support, and tree symmetry on the sensitivity of data
sets to taxonomic sampling. Sensitivity to sampling was measured by mapping characters from
a matrix of culled taxa onto optimal trees for that reduced matrix and onto the pruned optimal
tree for the entire matrix, then comparing the length of the reduced tree to the length
of the pruned complete tree. Within-data-set patterns can be described by a second-order
equation relating fraction of taxa sampled to sensitivity to sampling. Multiple regression
analyses found number of taxa to be a significant predictor of sensitivity to sampling;
retention index, number of informative characters, total support index, and tree symmetry
were nonsignificant predictors. I derived a predictive regression equation relating fraction
of taxa sampled and number of taxa potentially sampled to sensitivity to taxonomic sampling
and calculated values for this equation within the bounds of the variables examined.
The length difference between the complete tree and a subsampled tree was generally small
(average difference of 0-2.9 steps), indicating that subsampling taxa is probably not an
important problem for most phylogenetic analyses using up to 20 taxa.
[Taxonomic sampling; phylogeny estimation; multiple regression; modeling]
1 Address until November 1, 1998: Division of Amphibians and Reptiles, National Museum of
Natural History, Smithsonian Institution, Washington, DC 20560.
Abstract.---To explore the feasibility of parsimony analysis for large data
sets, we conducted heuristic parsimony searches and bootstrap analyses on
separate and combined DNA data sets for 190 angiosperms and three outgroups.
Separate data sets of 18S rDNA (1,855 bp), rbcL (1,428 bp), and atpß (1,450 bp)
sequences were combined into a single matrix 4,733 bp in length. Analyses of the
combined data set show great improvements in computer run times compared to those
of the separate data sets and of the data sets combined in pairs. Six searches
of the 18S rDNA + rbcL + atpß data set were conducted; in all cases TBR branch
swapping was completed, generally within a few days. In contrast, TBR branch
swapping was not completed for any of the three separate data sets, or for the
pairwise combined data sets. These results illustrate that it is possible to
conduct a thorough search of tree space with large data sets, given sufficient
signal. In this case, and probably most others, sufficient signal for a large
number of taxa can only be obtained by combining data sets. The combined data
sets also have higher internal support for clades than the separate data sets,
and more clades receive bootstrap support of >50% in the combined analysis than
in analyses of the separate data sets. These data suggest that one solution to
the computational and analytical dilemmas posed by large data sets is the
addition of nucleotides, as well as taxa.
[Large data sets, parsimony, phylogeny.]
Abstract.---Performance measures of phylogenetic estimation methods such as accuracy,
consistency, and power are an attempt at summarizing an ensemble of a given estimator's
behavior. These summaries characterize an ensemble behavior with a single number, leading
to a variety of definitions. In particular, the relationship between different performance
measures such as accuracy and consistency or accuracy and error depend on the exact
definition of these measures. In addition, it is relatively common to use large-sample
behavior to infer similar behavior for small samples. In fact, large-sample results such
as the claimed asymptotic efficiency of the maximum likelihood estimator are often
uninformative for small samples. Conversely, small-sample behavior using simulations is
sometimes used to imply large-sample behavior such as consistency. However, such
extrapolation is often difficult. How the performance of a phylogenetic estimator scales
with the addition of taxa must be qualified with respect to whether the whole tree is being
estimated or a fixed subset of taxa is being estimated. It must also be qualified with
respect to how tree models are sampled. Over the ensemble of all possible trees of a given
size, the performance of the estimators for the whole tree estimate suffers when the tree
size becomes larger. However, under certain models of cladogenesis, the estimate can
improve with the addition of taxa. In fact, at all numbers of taxa there are subsets
of tree models that are easier to estimate than others. This suggests that with judicious
addition or subtraction of taxa we can move from tree models that are more difficult to
estimate at one number of taxa to those that are easier to estimate at another number
of taxa.
[Accuracy; consistency; efficiency; large-scale phylogeny; performance]
Abstract.---Analyses of both the nucleotide and amino acid sequences derived
from all 13 mitochondrial protein-encoding genes (12,234 bp) of 19 metazoan
species, including that of the lancelet Branchiostoma floridae ("amphioxus"),
fail to yield the widely accepted phylogeny for chordates and, within chordates,
for vertebrates. Given the breadth and the compelling nature of the data
supporting that phylogeny, relationships supported by the mitochondrial sequence
comparisons are almost certainly incorrect, despite their being supported by
equally weighted parsimony, distance and maximum likelihood analyses. The
incorrect groupings probably result in part from convergent base-compositional
similarities among some of the taxa, similarities that are strong enough to
overwhelm the historical signal. Comparisons among very distantly related taxa
are likely to be particularly susceptible to such artifacts, because the
historical signal is already greatly attenuated. Empirical results underscore
the need for approaches to phylogenetic inference that go beyond simple
site-by-site comparison of aligned sequences. This study and others indicate
that, once a sequence sample of reasonable size has been obtained, accurate
phylogenetic estimation may be better served by incorporating knowledge of
molecular structures and processes into inference models and by seeking
additional higher order characters embedded in those sequence, than by gathering
ever larger sequence samples from the same organisms in the hope that the
historical signal will eventually prevail.
[Amphioxus; chordate phylogeny; homoplasy; mtDNA; molecular systematics;
phylogenetic inference]
Abstract.---We have developed a rapid parsimony method for reconstructing
ancestral nucleotide states that allows calculation of initial branch lengths
that are good approximations to optimal maximum likelihood estimates under
several commonly used substitution models. Use of these approximate branch
lengths (rather than fixed arbitrary values) as starting points significantly
reduces the time required for iteration to a solution that maximizes the
likelihood of a tree. These branch lengths are close enough to the optimal
values that they can be used without further iteration to calculate approximate
maximum likelihood scores that are very close to the "exact" scores found by
iteration. Several strategies are described for using these approximate scores
to substantially reduce times needed for maximum likelihood tree searches.
[Approximations; maximum likelihood; parsimony; tree searching.]
Abstract.---Recent phylogenetic analyses of cetacean relationships based on
DNA sequence data have challenged the traditional view that baleen whales
(Mysticeti) and toothed whales (Odontoceti) are each monophyletic, arguing
instead that baleen whales are the sister group of the odontocete family
Physeteridae (sperm whales). We reexamined this issue in light of a
morphological data set composed of 207 characters and molecular data sets of
published 12S, 16S, and cytochrome b mitochondrial DNA sequences. We reach four
primary conclusions: (1) Our morphological data set strongly supports the
traditional view of odontocete monophyly; (2) the unrooted molecular and
morphological trees are very similar, and most of the conflict results from
alternative rooting positions; (3) the rooting position of the molecular tree is
sensitive to choice of artiodactyl outgroup taxa and the treatment of two small
but ambiguously aligned regions of the 12S and 16S sequences, whereas the
morphological root is strongly supported; and (4) combined analyses of the
morphological and molecular data provide a well-supported phylogenetic estimate
consistent with that based on the morphological data alone (and the traditional
view of toothed-whale monophyly) but with increased bootstrap support at nearly
every node of the tree.
[Cetacea, DNA sequences, likelihood-ratio test, molecular clock, morphology,
Mysticeti, Odontoceti, partition homogeneity test, phylogeny, Templeton test].
3 Present Address: Centers for Disease Control and Prevention, Division of Viral and Rickettsial Diseases, MS G-33, 1600 Clifton Road, Atlanta, Georgia 30333, USA;
E-mail: sum4@cdc.gov
Abstract.---The effect of the evolutionary rate of a gene on the accuracy of
phylogeny reconstruction was examined by computer simulation. The evolutionary
rate is measured by the tree length, that is, the expected total number of
nucleotide substitutions per site on the phylogeny. DNA sequence data were
simulated using both fixed trees with specified branch lengths and random trees
with branch lengths generated from a model of cladogenesis. The parsimony and
likelihood methods were used for phylogeny reconstruction, and the proportion of
correctly recovered branch partitions by each method was estimated. Phylogenetic
methods including parsimony appear quite tolerant of multiple substitutions at
the same site. The optimum levels of sequence divergence were even higher than
upper limits previously suggested for saturation of substitutions, indicating
that the problem of saturation may have been exaggerated. Instead, the lack of
information at low levels of divergence should be seriously considered in
evaluation of a gene's phylogenetic utility, especially when the gene sequence
is short. The performance of parsimony, relative to that of likelihood, does not
necessarily decrease with the increase of the evolutionary rate.
[Branch lengths; homoplasy; likelihood; parsimony; phylogeny; optimum
evolutionary rate; saturation; simulation.]
1 Present address: Department of Biology, University College London, Galton Lab,
4 Stephenson Way, London NW1 2HE, England.