scholarly journals The asymptotic behavior of bootstrap support values in molecular phylogenetics

2020 ◽  
Author(s):  
Jun Huang ◽  
Yuting Liu ◽  
Tianqi Zhu ◽  
Ziheng Yang

Abstract The phylogenetic bootstrap is the most commonly used method for assessing statistical confidence in estimated phylogenies by non-Bayesian methods such as maximum parsimony and maximum likelihood (ML). It is observed that bootstrap support tends to be high in large genomic datasets whether or not the inferred trees and clades are correct. Here we study the asymptotic behavior of bootstrap support for the ML tree in large datasets when the competing phylogenetic trees are equally right or equally wrong. We consider phylogenetic reconstruction as a problem of statistical model selection when the compared models are nonnested and misspecified. The bootstrap is found to have qualitatively different dynamics from Bayesian inference, and does not exhibit the polarized behavior of posterior model probabilities, consistent with the empirical observation that the bootstrap is more conservative than Bayesian probabilities. Nevertheless bootstrap support similarly shows fluctuations among large datasets, with no convergence to a point value, when the compared models are equally right or equally wrong. Thus in large datasets strong support for wrong trees or models is likely to occur. Our analysis provides a partial explanation for the high bootstrap support values for incorrect clades observed in empirical data analysis.

Development ◽  
1994 ◽  
Vol 1994 (Supplement) ◽  
pp. 15-25
Author(s):  
Hervé Philippe ◽  
Anne Chenuil ◽  
André Adoutte

Most of the major invertebrate phyla appear in the fossil record during a relatively short time interval, not exceeding 20 million years (Myr), 540-520 Myr ago. This rapid diversification is known as the `Cambrian explosion'. In the present paper, we ask whether molecular phylogenetic reconstruction provides confirmation for such an evolutionary burst. The expectation is that the molecular phylogenetic trees should take the form of a large unresolved multifurcation of the various animal lineages. Complete 18S rRNA sequences of 69 extant representatives of 15 animal phyla were obtained from data banks. After eliminating a major source of artefact leading to lack of resolution in phylogenetic trees (mutational saturation of sequences), we indeed observe that the major lines of triploblast coelomates (arthropods, molluscs, echinoderms, chordates...) are very poorly resolved i.e. the nodes defining the various clades are not supported by high bootstrap values. Using a previously developed procedure consisting of calculating bootstrap proportions of each node of the tree as a function of increasing amount of nucleotides (Lecointre, G., Philippe, H. Le, H. L. V. and Le Guyader, H. (1994) Mol. Phyl. Evol., in press) we obtain a more informative indication of the robustness of each node. In addition, this procedure allows us to estimate the number of additional nucleotides that would be required to resolve confidently the currently uncertain nodes; this number turns out to be extremely high and experimentally unfeasible. We then take this approach one step further: using parameters derived from the above analysis, assuming a molecular clock and using palaeontological dates for calibration, we establish a relationship between the number of sites contained in a given data set and the time interval that this data set can confidently resolve (with 95% bootstrap support). Under these assumptions, the presently available 18S rRNA database cannot confidently resolve cladogenetic events separated by less than about 40 Myr. Thus, at the present time, the potential resolution by the palaeontological approach is higher than that by the molecular one.


2020 ◽  
Vol 6 (2) ◽  
Author(s):  
Kaat Ramaekers ◽  
Annabel Rector ◽  
Lize Cuypers ◽  
Philippe Lemey ◽  
Els Keyaerts ◽  
...  

Abstract Since the first human respiratory syncytial virus (HRSV) genotype classification in 1998, inconsistent conclusions have been drawn regarding the criteria that define HRSV genotypes and their nomenclature, challenging data comparisons between research groups. In this study, we aim to unify the field of HRSV genotype classification by reviewing the different methods that have been used in the past to define HRSV genotypes and by proposing a new classification procedure, based on well-established phylogenetic methods. All available complete HRSV genomes (>12,000 bp) were downloaded from GenBank and divided into the two subgroups: HRSV-A and HRSV-B. From whole-genome alignments, the regions that correspond to the open reading frame of the glycoprotein G and the second hypervariable region (HVR2) of the ectodomain were extracted. In the resulting partial alignments, the phylogenetic signal within each fragment was assessed. Maximum likelihood phylogenetic trees were reconstructed using the complete genome alignments. Patristic distances were calculated between all pairs of tips in the phylogenetic tree and summarized as a density plot in order to determine a cutoff value at the lowest point following the major distance peak. Our data show that neither the HVR2 fragment nor the G gene contains sufficient phylogenetic signal to perform reliable phylogenetic reconstruction. Therefore, whole-genome alignments were used to determine HRSV genotypes. We define a genotype using the following criteria: a bootstrap support of ≥70 per cent for the respective clade and a maximum patristic distance between all members of the clade of ≤0.018 substitutions per site for HRSV-A or ≤0.026 substitutions per site for HRSV-B. By applying this definition, we distinguish twenty-three genotypes within subtype HRSV-A and six genotypes within subtype HRSV-B. Applying the genotype criteria on subsampled data sets confirmed the robustness of the method.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5321 ◽  
Author(s):  
Xuhua Xia

Missing data are frequently encountered in molecular phylogenetics, but there has been no accurate distance imputation method available for distance-based phylogenetic reconstruction. The general framework for distance imputation is to explore tree space and distance values to find an optimal combination of output tree and imputed distances. Here I develop a least-square method coupled with multivariate optimization to impute multiple missing distance in a distance matrix or from a set of aligned sequences with missing genes so that some sequences share no homologous sites (whose distances therefore need to be imputed). I show that phylogenetic trees can be inferred from distance matrices with about 10% of distances missing, and the accuracy of the resulting phylogenetic tree is almost as good as the tree from full information. The new method has the advantage over a recently published one in that it does not assume a molecular clock and is more accurate (comparable to maximum likelihood method based on simulated sequences). I have implemented the function in DAMBE software, which is freely available athttp://dambe.bio.uottawa.ca.


Insects ◽  
2021 ◽  
Vol 12 (8) ◽  
pp. 668
Author(s):  
Tinghao Yu ◽  
Yalin Zhang

More studies are using mitochondrial genomes of insects to explore the sequence variability, evolutionary traits, monophyly of groups and phylogenetic relationships. Controversies remain on the classification of the Mileewinae and the phylogenetic relationships between Mileewinae and other subfamilies remain ambiguous. In this study, we present two newly completed mitogenomes of Mileewinae (Mileewa rufivena Cai and Kuoh 1997 and Ujna puerana Yang and Meng 2010) and conduct comparative mitogenomic analyses based on several different factors. These species have quite similar features, including their nucleotide content, codon usage of protein genes and the secondary structure of tRNA. Gene arrangement is identical and conserved, the same as the putative ancestral pattern of insects. All protein-coding genes of U. puerana began with the start codon ATN, while 5 Mileewa species had the abnormal initiation codon TTG in ND5 and ATP8. Moreover, M. rufivena had an intergenic spacer of 17 bp that could not be found in other mileewine species. Phylogenetic analysis based on three datasets (PCG123, PCG12 and AA) with two methods (maximum likelihood and Bayesian inference) recovered the Mileewinae as a monophyletic group with strong support values. All results in our study indicate that Mileewinae has a closer phylogenetic relationship to Typhlocybinae compared to Cicadellinae. Additionally, six species within Mileewini revealed the relationship (U. puerana + (M. ponta + (M. rufivena + M. alara) + (M. albovittata + M. margheritae))) in most of our phylogenetic trees. These results contribute to the study of the taxonomic status and phylogenetic relationships of Mileewinae.


2018 ◽  
Vol 115 (8) ◽  
pp. 1854-1859 ◽  
Author(s):  
Ziheng Yang ◽  
Tianqi Zhu

The Bayesian method is noted to produce spuriously high posterior probabilities for phylogenetic trees in analysis of large datasets, but the precise reasons for this overconfidence are unknown. In general, the performance of Bayesian selection of misspecified models is poorly understood, even though this is of great scientific interest since models are never true in real data analysis. Here we characterize the asymptotic behavior of Bayesian model selection and show that when the competing models are equally wrong, Bayesian model selection exhibits surprising and polarized behaviors in large datasets, supporting one model with full force while rejecting the others. If one model is slightly less wrong than the other, the less wrong model will eventually win when the amount of data increases, but the method may become overconfident before it becomes reliable. We suggest that this extreme behavior may be a major factor for the spuriously high posterior probabilities for evolutionary trees. The philosophical implications of our results to the application of Bayesian model selection to evaluate opposing scientific hypotheses are yet to be explored, as are the behaviors of non-Bayesian methods in similar situations.


2019 ◽  
Vol 35 (19) ◽  
pp. 3608-3616
Author(s):  
Ashley A Superson ◽  
Doug Phelan ◽  
Allyson Dekovich ◽  
Fabia U Battistuzzi

Abstract Motivation The promise of higher phylogenetic stability through increased dataset sizes within tree of life (TOL) reconstructions has not been fulfilled. Among the many possible causes are changes in species composition (taxon sampling) that could influence phylogenetic accuracy of the methods by altering the relative weight of the evolutionary histories of each individual species. This effect would be stronger in clades that are represented by few lineages, which is common in many prokaryote phyla. Indeed, phyla with fewer taxa showed the most discordance among recent TOL studies. We implemented an approach to systematically test how the identity of taxa among a larger dataset and the number of taxa included affected the accuracy of phylogenetic reconstruction. Results Utilizing an empirical dataset within Terrabacteria we found that even within scenarios consisting of the same number of taxa, the species used strongly affected phylogenetic stability. Furthermore, we found that trees with fewer species were more dissimilar to the tree produced from the full dataset. These results hold even when the tree is composed by many phyla and only one of them is being altered. Thus, the effect of taxon sampling in one group does not seem to be buffered by the presence of many other clades, making this issue relevant even to very large datasets. Our results suggest that a systematic evaluation of phylogenetic stability through taxon resampling is advisable even for very large datasets. Availability and implementation https://github.com/BlabOaklandU/PATS.git. Supplementary information Supplementary data are available at Bioinformatics online.


Molecules ◽  
2019 ◽  
Vol 24 (2) ◽  
pp. 261 ◽  
Author(s):  
Yongfu Li ◽  
Steven Paul Sylvester ◽  
Meng Li ◽  
Cheng Zhang ◽  
Xuan Li ◽  
...  

Magnolia zenii is a critically endangered species known from only 18 trees that survive on Baohua Mountain in Jiangsu province, China. Little information is available regarding its molecular biology, with no genomic study performed on M. zenii until now. We determined the complete plastid genome of M. zenii and identified microsatellites. Whole sequence alignment and phylogenetic analysis using BI and ML methods were also conducted. The plastome of M. zenii was 160,048 bp long with 39.2% GC content and included a pair of inverted repeats (IRs) of 26,596 bp that separated a large single-copy (LSC) region of 88,098 bp and a small single-copy (SSC) region of 18,757 bp. One hundred thirty genes were identified, of which 79 were protein-coding genes, 37 were transfer RNAs, and eight were ribosomal RNAs. Thirty seven simple sequence repeats (SSRs) were also identified. Comparative analyses of genome structure and sequence data of closely-related species revealed five mutation hotspots, useful for future phylogenetic research. Magnolia zenii was placed as sister to M. biondii with strong support in all analyses. Overall, this study providing M. zenii genomic resources will be beneficial for the evolutionary study and phylogenetic reconstruction of Magnoliaceae.


Diversity ◽  
2020 ◽  
Vol 12 (8) ◽  
pp. 288
Author(s):  
Nuria Macías-Hernández ◽  
Marc Domènech ◽  
Pedro Cardoso ◽  
Brent C. Emerson ◽  
Paulo Alexandre Vieira Borges ◽  
...  

Phylogenetic relatedness is a key diversity measure for the analysis and understanding of how species and communities evolve across time and space. Understanding the nonrandom loss of species with respect to phylogeny is also essential for better-informed conservation decisions. However, several factors are known to influence phylogenetic reconstruction and, ultimately, phylogenetic diversity metrics. In this study, we empirically tested how some of these factors (topological constraint, taxon sampling, genetic markers and calibration) affect phylogenetic resolution and uncertainty. We built a densely sampled, species-level phylogenetic tree for spiders, combining Sanger sequencing of species from local communities of two biogeographical regions (Iberian Peninsula and Macaronesia) with a taxon-rich backbone matrix of Genbank sequences and a topological constraint derived from recent phylogenomic studies. The resulting tree constitutes the most complete spider phylogeny to date, both in terms of terminals and background information, and may serve as a standard reference for the analysis of phylogenetic diversity patterns at the community level. We then used this tree to investigate how partial data affect phylogenetic reconstruction, phylogenetic diversity estimates and their rankings, and, ultimately, the ecological processes inferred for each community. We found that the incorporation of a single slowly evolving marker (28S) to the DNA barcode sequences from local communities, had the highest impact on tree topology, closely followed by the use of a backbone matrix. The increase in missing data resulting from combining partial sequences from local communities only had a moderate impact on the resulting trees, similar to the difference observed when using topological constraints. Our study further revealed substantial differences in both the phylogenetic structure and diversity rankings of the analyzed communities estimated from the different phylogenetic treatments, especially when using non-ultrametric trees (phylograms) instead of time-stamped trees (chronograms). Finally, we provide some recommendations on reconstructing phylogenetic trees to infer phylogenetic diversity within ecological studies.


1990 ◽  
Vol 68 (7) ◽  
pp. 1433-1440 ◽  
Author(s):  
William J. Crins

Few estimates of phylogenetic relationship below the sectional level have been proposed within the genus Carex (Cyperaceae). The reasons for this include (1) poorly circumscribed sections (paraphyletic and polyphyletic), (2) uncertain relationships among sections, and (3) difficulty in objectively assessing character state polarities. The operational difficulties posed by points 2 and 3 can be overcome through the use of character compatibility analysis, if it can be demonstrated that the section under study is monophyletic (point 1). This technique enables the investigator to generate hypotheses of relationship while minimizing the number of prior assumptions. Hypotheses of phylogenetic relationship are presented for the taxa within Carex sections Phyllostachyae, Limosae, and Ceratocystis. The topologies of these unrooted networks are assessed using external data sets (chromosome numbers, etc.) that serve as tests of the hypotheses, and may allow for a posteriori determination of character state polarities. In sections Ceratocystis and Limosae, these analyses provide strong support for the notion that chromosome evolution in Carex proceeds in a linear stepwise fashion. The results for section Phyllostachyae contradict this notion. Synthesis of all available data, coupled with phylogenetic reconstruction, will enable caricologists to provide more convincing arguments about the nature, direction, and factors influencing character state change.


Phytotaxa ◽  
2018 ◽  
Vol 360 (3) ◽  
pp. 220 ◽  
Author(s):  
ZAHRA ARABI ◽  
FARROKH GHAHREMANINEJAD ◽  
RICHARD K. RABELER ◽  
IRINA SOKOLOVA ◽  
GÜNTHER HEUBL ◽  
...  

The status of the genus Dichodon has long been debated, and its taxonomic position in tribe Alsineae has been changed during the time from a section or subgenus in Cerastium to genus sister to Holosteum. This group comprises important members of wet meadows in alpine and subalpine vegetation of Europe, arctic regions, and SW-Asia plus one species known as a weed in N-America, and a further one occuring in mountains of Taiwan. In order to clarify the taxonomic questions concerning this group and its species delimitation, we constructed phylogenetic trees, selecting several species belonging to tribe Alsineae as representatives of major lineages of this tribe as well as several accessions of Dichodon. Morphological studies focused more intensively on members of Dichodon using herbarium specimens and direct field examinations. The results confirm those of recent molecular phylogenetic studies, indicating Dichodon as a monophyletic genus sister to Holosteum and not Cerastium. In addition, the obtained cladograms support five distinct groups in Dichodon corresponding to five species of this genus we recognize in Iran, the focal area of this study. Seed micromorphology provides strong support for the recognition of Dichodon as a separate genus, but it is not informative at species and subspecies ranks due to constancy of most of seed characters within the genus. As part of this study, a new species—Dichodon alborzensis—is described, D. kotschyi is reported in Iran for the first time, and Cerastium schischkinii is placed in synonymy (new synonymy) under D. kotschyi.


Sign in / Sign up

Export Citation Format

Share Document