scholarly journals Automation and Evaluation of the SOWH Test with SOWHAT

2014 ◽  
Author(s):  
Samuel H. Church ◽  
Joseph F. Ryan ◽  
Casey W. Dunn

The Swofford-Olsen-Waddell-Hillis (SOWH) test evaluates statistical support for incongruent phylogenetic topologies. It is commonly applied to determine if the maximum likelihood tree in a phylogenetic analysis is significantly different than an alternative hypothesis. The SOWH test compares the observed difference in likelihood between two topologies to a null distribution of differences in likelihood generated by parametric resampling. The test is a well-established phylogenetic method for topology testing, but is is sensitive to model misspecification, it is computationally burdensome to perform, and its implementation requires the investigator to make multiple decisions that each have the potential to affect the outcome of the test. We analyzed the effects of multiple factors using seven datasets to which the SOWH test was previously applied. These factors include bootstrap sample size, likelihood software, the introduction of gaps to simulated data, the use of distinct models of evolution for data simulation and likelihood inference, and a suggested test correction wherein an unresolved "zero-constrained" tree is used to simulate sequence data. In order to facilitate these analyses and future applications of the SOWH test, we wrote SOWHAT, a program that automates the SOWH test. We find that inadequate bootstrap sampling can change the outcome of the SOWH test. The results also show that using a zero-constrained tree for data simulation can result in a wider null distribution and higher p-values, but does not change the outcome of the SOWH test for most datasets. These results will help others implement and evaluate the SOWH test and allow us to provide recommendation for future applications of the SOWH test. SOWHAT is available for download from https://github.com/josephryan/SOWHAT.

2021 ◽  
Author(s):  
Tae-Kun Seo ◽  
Olivier Gascuel ◽  
Jeffrey L Thorne

Abstract Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the Effective Sequence Length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification.


Symmetry ◽  
2021 ◽  
Vol 13 (6) ◽  
pp. 936
Author(s):  
Dan Wang

In this paper, a ratio test based on bootstrap approximation is proposed to detect the persistence change in heavy-tailed observations. This paper focuses on the symmetry testing problems of I(1)-to-I(0) and I(0)-to-I(1). On the basis of residual CUSUM, the test statistic is constructed in a ratio form. I prove the null distribution of the test statistic. The consistency under alternative hypothesis is also discussed. However, the null distribution of the test statistic contains an unknown tail index. To address this challenge, I present a bootstrap approximation method for determining the rejection region of this test. Simulation studies of artificial data are conducted to assess the finite sample performance, which shows that our method is better than the kernel method in all listed cases. The analysis of real data also demonstrates the excellent performance of this method.


Mammalia ◽  
2019 ◽  
Vol 83 (2) ◽  
pp. 180-189 ◽  
Author(s):  
Adam W. Ferguson ◽  
Houssein R. Roble ◽  
Molly M. McDonough

AbstractThe molecular phylogeny of extant genets (Carnivora, Viverridae,Genetta) was generated using all species with the exception of the Ethiopian genetGenetta abyssinica. Herein, we provide the first molecular phylogenetic assessment ofG. abyssinicausing molecular sequence data from multiple mitochondrial genes generated from a recent record of this species from the Forêt du Day (the Day Forest) in Djibouti. This record represents the first verified museum specimen ofG. abyssinicacollected in over 60 years and the first specimen with a specific locality for the country of Djibouti. Multiple phylogenetic analyses revealed conflicting results as to the exact relationship ofG. abyssinicato otherGenettaspecies, providing statistical support for a sister relationship to all other extant genets for only a subset of mitochondrial analyses. Despite the inclusion of this species for the first time, phylogenetic relationships amongGenettaspecies remain unclear, with limited nodal support for many species. In addition to providing an alternative hypothesis of the phylogenetic relationships among extant genets, this recent record provides the first complete skeleton of this species to our knowledge and helps to shed light on the distribution and habitat use of this understudied African small carnivore.


2020 ◽  
Author(s):  
Thibaut Sellinger ◽  
Diala Abu Awad ◽  
Aurélien Tellier

AbstractMany methods based on the Sequentially Markovian Coalescent (SMC) have been and are being developed. These methods make use of genome sequence data to uncover population demographic history. More recently, new methods have extended the original theoretical framework, allowing the simultaneous estimation of the demographic history and other biological variables. These methods can be applied to many different species, under different model assumptions, in hopes of unlocking the population/species evolutionary history. Although convergence proofs in particular cases have been given using simulated data, a clear outline of the performance limits of these methods is lacking. We here explore the limits of this methodology, as well as present a tool that can be used to help users quantify what information can be confidently retrieved from given datasets. In addition, we study the consequences for inference accuracy violating the hypotheses and the assumptions of SMC approaches, such as the presence of transposable elements, variable recombination and mutation rates along the sequence and SNP call errors. We also provide a new interpretation of the SMC through the use of the estimated transition matrix and offer recommendations for the most efficient use of these methods under budget constraints, notably through the building of data sets that would be better adapted for the biological question at hand.


Author(s):  
O. Perrin ◽  
S. Christophe ◽  
F. Jacquinod ◽  
O. Payrastre

Abstract. We present our contribution to the geovisualization and visual analysis of hydraulic simulation data, based on an interdisciplinary research work undertaken by researchers in geographic information sciences and in hydraulics. The positive feedback loop between researchers favored the proposal of visualization tools enabling visual reasoning on hydraulic simulated data so as to infer knowledge on the simulation model. We interactively explore and design 2D multi-scale styles to render hydraulic simulated data, in order to support the identification over large simulation domains of possible local inconsistencies related to input simulation data, simulation parameters or simulation workflow. Models have been implemented into QGIS and are reusable for other input data and territories.


2020 ◽  
Vol 117 (29) ◽  
pp. 16880-16890 ◽  
Author(s):  
Larry Wasserman ◽  
Aaditya Ramdas ◽  
Sivaraman Balakrishnan

We propose a general method for constructing confidence sets and hypothesis tests that have finite-sample guarantees without regularity conditions. We refer to such procedures as “universal.” The method is very simple and is based on a modified version of the usual likelihood-ratio statistic that we call “the split likelihood-ratio test” (split LRT) statistic. The (limiting) null distribution of the classical likelihood-ratio statistic is often intractable when used to test composite null hypotheses in irregular statistical models. Our method is especially appealing for statistical inference in these complex setups. The method we suggest works for any parametric model and also for some nonparametric models, as long as computing a maximum-likelihood estimator (MLE) is feasible under the null. Canonical examples arise in mixture modeling and shape-constrained inference, for which constructing tests and confidence sets has been notoriously difficult. We also develop various extensions of our basic methods. We show that in settings when computing the MLE is hard, for the purpose of constructing valid tests and intervals, it is sufficient to upper bound the maximum likelihood. We investigate some conditions under which our methods yield valid inferences under model misspecification. Further, the split LRT can be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid P values and confidence sequences. Finally, when combined with the method of sieves, it can be used to perform model selection with nested model classes.


2019 ◽  
Vol 37 (5) ◽  
pp. 1495-1507 ◽  
Author(s):  
Zhengting Zou ◽  
Hongjiu Zhang ◽  
Yuanfang Guan ◽  
Jianzhi Zhang

Abstract Phylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification or insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex nonlinear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. Furthermore, when combined with the quartet puzzling algorithm, residual network predictors can be used to reconstruct trees with more than four taxa. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl; last accessed January 3, 2020).


2020 ◽  
Vol 110 (2) ◽  
pp. 517-525 ◽  
Author(s):  
Miguel A. Redondo ◽  
Jan Stenlid ◽  
Jonàs Oliva

Predicting whether naïve tree populations have the potential to adapt to exotic pathogens is necessary owing to the increasing rate of invasions. Adaptation may occur as a result of natural selection when heritable variation in terms of susceptibility exists in the naïve population. We searched for signs of selection on black alder (Alnus glutinosa) stands growing on riverbanks invaded by two pathogens differing in aggressiveness, namely, Phytophthora uniformis (PU) and Phytophthora × alni (PA). We compared the survival and heritability measures from 72 families originating from six invaded and uninvaded (naïve) sites by performing in vitro inoculations. The results from the inoculations were used to assess the relative contribution of host genetic variation on natural selection. We found putative signs of natural selection on alder exerted by PU but not by PA. For PU, we found a higher survival in families originating from invaded sites compared with uninvaded sites. The narrow sense heritability of susceptibility to PU of uninvaded populations was significantly higher than to PA. Simulated data supported the role of heritable genetic variation on natural selection and discarded a high aggressiveness of PA decreasing the transmission rate as an alternative hypothesis for a slow natural selection. Our findings expand on previous attempts of using heritability as a predictor for the likelihood of natural adaptation of naïve tree populations to invasive pathogens. Measures of genetic variation can be useful for risk assessment purposes or when managing Phytophthora invasions.


2018 ◽  
Vol 30 (1) ◽  
pp. 216-236
Author(s):  
Rasmus Troelsgaard ◽  
Lars Kai Hansen

Model-based classification of sequence data using a set of hidden Markov models is a well-known technique. The involved score function, which is often based on the class-conditional likelihood, can, however, be computationally demanding, especially for long data sequences. Inspired by recent theoretical advances in spectral learning of hidden Markov models, we propose a score function based on third-order moments. In particular, we propose to use the Kullback-Leibler divergence between theoretical and empirical third-order moments for classification of sequence data with discrete observations. The proposed method provides lower computational complexity at classification time than the usual likelihood-based methods. In order to demonstrate the properties of the proposed method, we perform classification of both simulated data and empirical data from a human activity recognition study.


Phytotaxa ◽  
2014 ◽  
Vol 174 (4) ◽  
pp. 187 ◽  
Author(s):  
Lakshmi Attigala ◽  
Jimmy K. Triplett ◽  
Hashendra-Suvini Kathriarachchi ◽  
Lynn G Clark

Kuruna, a new temperate woody bamboo (Poaceae, Bambusoideae, Arundinarieae) genus from Sri Lanka, is recognized based on chloroplast sequence data from five markers (coding: ndhF 3’ end; non-coding: rps16-trnQ, trnC-rpoB, trnD-trnT, trnT-trnL). This genus represents the twelfth major lineage of temperate woody bamboos and is characterized by pachymorph culm bases with short necks, unicaespitose clumps, culm leaf girdles ca. 1 mm wide, usually abaxially hispid culm leaves with non-irritating hairs, persistent foliage leaf sheaths, complete branch sheathing and acute to biapiculate palea apices. Maximum Parsimony, Bayesian Inference and Maximum Likelihood analyses of a combined data set consistently strongly supported the monophyly of this Sri Lankan temperate woody bamboo clade. Although the Kishino-Hasegawa test is unable to reject the alternative hypothesis of monophyly of the Sri Lankan clade plus Bergbambos tessellata from South Africa, Kuruna and Bergbambos are distinguishable by a combination of morphological characters. A few additional cpDNA markers not previously used in phylogenetic analyses of Arundinarieae were tested to evaluate their utility in this taxonomically difficult tribe.


Sign in / Sign up

Export Citation Format

Share Document