scholarly journals Employing phylogenetic tree shape statistics to resolve the underlying host population structure

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Hassan W. Kayondo ◽  
Alfred Ssekagiri ◽  
Grace Nabakooza ◽  
Nicholas Bbosa ◽  
Deogratius Ssemwanga ◽  
...  

Abstract Background Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure. Results In this paper, we simulate phylogenetic trees from both structured and non-structured host populations. We compute eight statistics for the simulated trees, which are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. Based on the estimated tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support vector machine (SVM). We incorporate the basic reproductive number ($$R_0$$ R 0 ) in our tree simulation procedure. Sensitivity analysis is done to investigate whether the classifiers are robust to different choice of model parameters and to size of trees. Cross-validated results for area under the curve (AUC) for receiver operating characteristic (ROC) curves yield mean values of over 0.9 for most of the classification models. Conclusions Our classification procedure distinguishes well between trees from structured and non-structured populations using the classifiers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models were more robust to changes in model parameters and tree size compared to KNN and DT classifiers. Our classification procedure was applied to real -world data and the structured population was revealed with high accuracy of $$92.3\%$$ 92.3 % using SVM-polynomial classifier.

2020 ◽  
Vol 69 (5) ◽  
pp. 884-896 ◽  
Author(s):  
Erik M Volz ◽  
Wiuf Carsten ◽  
Yonatan H Grad ◽  
Simon D W Frost ◽  
Ann M Dennis ◽  
...  

Abstract Population structure influences genealogical patterns, however, data pertaining to how populations are structured are often unavailable or not directly observable. Inference of population structure is highly important in molecular epidemiology where pathogen phylogenetics is increasingly used to infer transmission patterns and detect outbreaks. Discrepancies between observed and idealized genealogies, such as those generated by the coalescent process, can be quantified, and where significant differences occur, may reveal the action of natural selection, host population structure, or other demographic and epidemiological heterogeneities. We have developed a fast non-parametric statistical test for detection of cryptic population structure in time-scaled phylogenetic trees. The test is based on contrasting estimated phylogenies with the theoretically expected phylodynamic ordering of common ancestors in two clades within a coalescent framework. These statistical tests have also motivated the development of algorithms which can be used to quickly screen a phylogenetic tree for clades which are likely to share a distinct demographic or epidemiological history. Epidemiological applications include identification of outbreaks in vulnerable host populations or rapid expansion of genotypes with a fitness advantage. To demonstrate the utility of these methods for outbreak detection, we applied the new methods to large phylogenies reconstructed from thousands of HIV-1 partial pol sequences. This revealed the presence of clades which had grown rapidly in the recent past and was significantly concentrated in young men, suggesting recent and rapid transmission in that group. Furthermore, to demonstrate the utility of these methods for the study of antimicrobial resistance, we applied the new methods to a large phylogeny reconstructed from whole genome Neisseria gonorrhoeae sequences. We find that population structure detected using these methods closely overlaps with the appearance and expansion of mutations conferring antimicrobial resistance. [Antimicrobial resistance; coalescent; HIV; population structure.]


2019 ◽  
Author(s):  
Erik M. Volz ◽  
Carsten Wiuf ◽  
Yonatan H. Grad ◽  
Simon D.W. Frost ◽  
Ann M. Dennis ◽  
...  

AbstractPopulation structure influences genealogical patterns, however data pertaining to how populations are structured are often unavailable or not directly observable. Inference of population structure is highly important in molecular epidemiology where pathogen phylogenetics is increasingly used to infer transmission patterns and detect outbreaks. Discrepancies between observed and idealised genealogies, such as those generated by the coalescent process, can be quantified, and where significant differences occur, may reveal the action of natural selection, host population structure, or other demographic and epidemiological heterogeneities. We have developed a fast non-parametric statistical test for detection of cryptic population structure in time-scaled phylogenetic trees. The test is based on contrasting estimated phylogenies with the theoretically expected phylodynamic ordering of common ancestors in two clades within a coalescent framework. These statistical tests have also motivated the development of algorithms which can be used to quickly screen a phylogenetic tree for clades which are likely to share a distinct demographic or epidemiological history. Epidemiological applications include identification of outbreaks in vulnerable host populations or rapid expansion of genotypes with a fitness advantage. To demonstrate the utility of these methods for outbreak detection, we applied the new methods to large phylogenies reconstructed from thousands of HIV-1 partial pol sequences. This revealed the presence of clades which had grown rapidly in the recent past, and was significantly concentrated in young men, suggesting recent and rapid transmission in that group. Furthermore, to demonstrate the utility of these methods for the study of antimicrobial resistance, we applied the new methods to a large phylogeny reconstructed from whole genome Neisseria gonorrhoeae sequences. We find that population structure detected using these methods closely overlaps with the appearance and expansion of mutations conferring antimicrobial resistance.


2019 ◽  
Vol 5 (2) ◽  
Author(s):  
Nicola F Müller ◽  
Gytis Dudas ◽  
Tanja Stadler

Abstract Population dynamics can be inferred from genetic sequence data by using phylodynamic methods. These methods typically quantify the dynamics in unstructured populations or assume migration rates and effective population sizes to be constant through time in structured populations. When considering rates to vary through time in structured populations, the number of parameters to infer increases rapidly and the available data might not be sufficient to inform these. Additionally, it is often of interest to know what predicts these parameters rather than knowing the parameters themselves. Here, we introduce a method to  infer the predictors for time-varying migration rates and effective population sizes by using a generalized linear model (GLM) approach under the marginal approximation of the structured coalescent. Using simulations, we show that our approach is able to reliably infer the model parameters and its predictors from phylogenetic trees. Furthermore, when simulating trees under the structured coalescent, we show that our new approach outperforms the discrete trait GLM model. We then apply our framework to a previously described Ebola virus dataset, where we infer the parameters and its predictors from genome sequences while accounting for phylogenetic uncertainty. We infer weekly cases to be the strongest predictor for effective population size and geographic distance the strongest predictor for migration. This approach is implemented as part of the BEAST2 package MASCOT, which allows us to jointly infer population dynamics, i.e. the parameters and predictors, within structured populations, the phylogenetic tree, and evolutionary parameters.


2019 ◽  
Author(s):  
Roman Kogay ◽  
Taylor B. Neely ◽  
Daniel P. Birnbaum ◽  
Camille R. Hankel ◽  
Migun Shakya ◽  
...  

AbstractMany of the sequenced bacterial and archaeal genomes encode regions of viral provenance. Yet, not all of these regions encode bona fide viruses. Gene transfer agents (GTAs) are thought to be former viruses that are now maintained in genomes of some bacteria and archaea and are hypothesized to enable exchange of DNA within bacterial populations. In Alphaproteobacteria, genes homologous to the ‘head-tail’ gene cluster that encodes structural components of the Rhodobacter capsulatus GTA (RcGTA) are found in many taxa, even if they are only distantly related to Rhodobacter capsulatus. Yet, in most genomes available in GenBank RcGTA-like genes have annotations of typical viral proteins, and therefore are not easily distinguished from their viral homologs without additional analyses. Here, we report a ‘support vector machine’ classifier that quickly and accurately distinguishes RcGTA-like genes from their viral homologs by capturing the differences in the amino acid composition of the encoded proteins. Our open-source classifier is implemented in Python and can be used to scan homologs of the RcGTA genes in newly sequenced genomes. The classifier can also be trained to identify other types of GTAs, or even to detect other elements of viral ancestry. Using the classifier trained on a manually curated set of homologous viruses and GTAs, we detected RcGTA-like ‘head-tail’ gene clusters in 57.5% of the 1,423 examined alphaproteobacterial genomes. We also demonstrated that more than half of the in silico prophage predictions are instead likely to be GTAs, suggesting that in many alphaproteobacterial genomes the RcGTA-like elements remain unrecognized.Data depositionSequence alignments and phylogenetic trees are available in a FigShare repository at DOI 10.6084/m9.figshare.8796419. The Python source code of the described classifier and additional scripts used in the analyses are available via a GitHub repository at https://github.com/ecg-lab/GTA-Hunter-v1


2013 ◽  
Vol 368 (1614) ◽  
pp. 20120198 ◽  
Author(s):  
Tanja Stadler ◽  
Sebastian Bonhoeffer

Host population structure has a major influence on epidemiological dynamics. However, in particular for sexually transmitted diseases, quantitative data on population contact structure are hard to obtain. Here, we introduce a new method that quantifies host population structure based on phylogenetic trees, which are obtained from pathogen genetic sequence data. Our method is based on a maximum-likelihood framework and uses a multi-type branching process, under which each host is assigned to a type (subpopulation). In a simulation study, we show that our method produces accurate parameter estimates for phylogenetic trees in which each tip is assigned to a type, as well for phylogenetic trees in which the type of the tip is unknown. We apply the method to a Latvian HIV-1 dataset, quantifying the impact of the intravenous drug user epidemic on the heterosexual epidemic (known tip states), and identifying superspreader dynamics within the men-having-sex-with-men epidemic (unknown tip states).


The increased usage of the Internet and social networks allowed and enabled people to express their views, which have generated an increasing attention lately. Sentiment Analysis (SA) techniques are used to determine the polarity of information, either positive or negative, toward a given topic, including opinions. In this research, we have introduced a machine learning approach based on Support Vector Machine (SVM), Naïve Bayes (NB) and Random Forest (RF) classifiers, to find and classify extreme opinions in Arabic reviews. To achieve this, a dataset of 1500 Arabic reviews was collected from Google Play Store. In addition, a two-stage Classification process was applied to classify the reviews. In the first stage, we built a binary classifier to sort out positive from negative reviews. In the second stage, however we applied a binary classification mechanism based on a set of proposed rules that distinguishes extreme positive from positive reviews, and extreme negative from negative reviews. Four major experiments were conducted with a total of 10 different sub experiments to fulfill the two-stage process using different X-validation schemas and Term Frequency-Inverse Document Frequency feature selection method. Obtained results have indicated that SVM was the best during the first stage classification with 30% testing data, and NB was the best with 20% testing data. The results of the second stage classification indicated that SVM has scored better results in identifying extreme positive reviews when dealing with the positive dataset with an overall accuracy of 68.7% and NB showed better accuracy results in identifying extreme negative reviews when dealing with the negative dataset, with an overall accuracy of 72.8%.


2018 ◽  
Vol 11 (1) ◽  
pp. 64 ◽  
Author(s):  
Kyoung-jae Kim ◽  
Kichun Lee ◽  
Hyunchul Ahn

Measuring and managing the financial sustainability of the borrowers is crucial to financial institutions for their risk management. As a result, building an effective corporate financial distress prediction model has been an important research topic for a long time. Recently, researchers are exerting themselves to improve the accuracy of financial distress prediction models by applying various business analytics approaches including statistical and artificial intelligence methods. Among them, support vector machines (SVMs) are becoming popular. SVMs require only small training samples and have little possibility of overfitting if model parameters are properly tuned. Nonetheless, SVMs generally show high prediction accuracy since it can deal with complex nonlinear patterns. Despite of these advantages, SVMs are often criticized because their architectural factors are determined by heuristics, such as the parameters of a kernel function and the subsets of appropriate features and instances. In this study, we propose globally optimized SVMs, denoted by GOSVM, a novel hybrid SVM model designed to optimize feature selection, instance selection, and kernel parameters altogether. This study introduces genetic algorithm (GA) in order to simultaneously optimize multiple heterogeneous design factors of SVMs. Our study applies the proposed model to the real-world case for predicting financial distress. Experiments show that the proposed model significantly improves the prediction accuracy of conventional SVMs.


2018 ◽  
Vol 13 ◽  
pp. 174830181879706 ◽  
Author(s):  
Song Qiang ◽  
Yang Pu

In this work, we summarized the characteristics and influencing factors of load forecasting based on its application status. The common methods of the short-term load forecasting were analyzed to derive their advantages and disadvantages. According to the historical load and meteorological data in a certain region of Taizhou, Zhejiang Province, a least squares support vector machine model was used to discuss the influencing factors of forecasting. The regularity of the load change was concluded to correct the “abnormal data” in the historical load data, thus normalizing the relevant factors in load forecasting. The two parameters are as follows Gauss kernel function and Eigen parameter C in LSSVM had a significant impact on the model, which was still solved by empirical methods. Therefore, the particle swarm optimization was used to optimize the model parameters. Taking the error of test set as the basis of judgment, the optimization of model parameters was achieved to improve forecast accuracy. The practical examples showed that the method in the work had good convergence, forecast accuracy, and training speed.


2020 ◽  
Author(s):  
Zeqi Yao ◽  
Kehui Liu ◽  
Shanjun Deng ◽  
Xionglei He

AbstractConventional coalescent inferences of population history make the critical assumption that the population under examination is panmictic. However, most populations are structured. This complicates the prevailing coalescent analyses and sometimes leads to inaccurate estimates. To develop a coalescent method unhampered by population structure, we perform two analyses. First, we demonstrate that the coalescent probability of two randomly sampled alleles from the immediate preceding generation (one generation back) is independent of population structure. Second, motivated by this finding, we propose a new coalescent method: i-coalescent analysis. i-coalescent analysis computes the instantaneous coalescent rate (iCR) by using a phylogenetic tree of sampled alleles. Using simulated data, we broadly demonstrate the capability of i-coalescent analysis to accurately reconstruct population size dynamics of highly structured populations, although we find this method often requires larger sample sizes for structured populations than for panmictic populations. Overall, our results indicate i-coalescent analysis to be a useful tool, especially for the inference of population histories with intractable structure such as the developmental history of cell populations in the organs of complex organisms.


Author(s):  
Ahmad Iwan Fadli ◽  
Selo Sulistyo ◽  
Sigit Wibowo

Traffic accident is a very difficult problem to handle on a large scale in a country. Indonesia is one of the most populated, developing countries that use vehicles for daily activities as its main transportation.  It is also the country with the largest number of car users in Southeast Asia, so driving safety needs to be considered. Using machine learning classification method to determine whether a driver is driving safely or not can help reduce the risk of driving accidents. We created a detection system to classify whether the driver is driving safely or unsafely using trip sensor data, which include Gyroscope, Acceleration, and GPS. The classification methods used in this study are Random Forest (RF) classification algorithm, Support Vector Machine (SVM), and Multilayer Perceptron (MLP) by improving data preprocessing using feature extraction and oversampling methods. This study shows that RF has the best performance with 98% accuracy, 98% precision, and 97% sensitivity using the proposed preprocessing stages compared to SVM or MLP.


Sign in / Sign up

Export Citation Format

Share Document