Harnessing machine learning to boost heuristic strategies for phylogenetic-tree search

Abstract Inferring a phylogenetic tree, which describes the evolutionary relationships among a set of organisms, genes, or genomes, is a fundamental step in numerous evolutionary studies. With the aim of making tree inference feasible for problems involving more than a handful of sequences, current algorithms for phylogenetic tree reconstruction utilize various heuristic approaches. Such approaches rely on performing costly likelihood optimizations, and thus evaluate only a subset of all potential trees. Consequently, all existing methods suffer from the known tradeoff between accuracy and running time. Here, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus avoiding numerous expensive likelihood computations. Our analyses suggest that machine-learning approaches can make heuristic tree searches substantially faster without losing accuracy and thus could be incorporated for narrowing down the examined neighboring trees of each intermediate tree in any tree search methodology.

Download Full-text

Harnessing machine learning to guide phylogenetic-tree search algorithms

Nature Communications ◽

10.1038/s41467-021-22073-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Dana Azouri ◽

Shiran Abadi ◽

Yishay Mansour ◽

Itay Mayrose ◽

Tal Pupko

Keyword(s):

Machine Learning ◽

Phylogenetic Tree ◽

Learning Algorithm ◽

Search Space ◽

Large Set ◽

Tree Search ◽

Proof Of Concept ◽

Tree Reconstruction ◽

Promising Candidate ◽

Tree Inference

AbstractInferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.

Download Full-text

Predicting Student’s Performance Using Machine Learning Algorithm

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1209 ◽

2021 ◽

pp. 53-58

Author(s):

Sheela Rani P ◽

Dhivya S ◽

Dharshini Priya M ◽

Dharmila Chowdary A

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Prediction Model ◽

Naive Bayes ◽

Learning Algorithm ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approaches ◽

K Nearest Neighbors

Machine learning is a new analysis discipline that uses knowledge to boost learning, optimizing the training method and developing the atmosphere within which learning happens. There square measure 2 sorts of machine learning approaches like supervised and unsupervised approach that square measure accustomed extract the knowledge that helps the decision-makers in future to require correct intervention. This paper introduces an issue that influences students' tutorial performance prediction model that uses a supervised variety of machine learning algorithms like support vector machine , KNN(k-nearest neighbors), Naïve Bayes and supplying regression and logistic regression. The results supported by various algorithms are compared and it is shown that the support vector machine and Naïve Bayes performs well by achieving improved accuracy as compared to other algorithms. The final prediction model during this paper may have fairly high prediction accuracy .The objective is not just to predict future performance of students but also provide the best technique for finding the most impactful features that influence student’s while studying.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

Precipitation Estimates from MSG SEVIRI Daytime, Nighttime, and Twilight Data with Random Forests

Journal of Applied Meteorology and Climatology ◽

10.1175/jamc-d-14-0082.1 ◽

2014 ◽

Vol 53 (11) ◽

pp. 2457-2480 ◽

Cited By ~ 30

Author(s):

Meike Kühnlein ◽

Tim Appelhans ◽

Boris Thies ◽

Thomas Nauß

Keyword(s):

Machine Learning ◽

Random Forests ◽

Learning Algorithm ◽

Radar Data ◽

Convective Precipitation ◽

General Tendency ◽

Learning Approaches ◽

Rain Area ◽

Individual Step ◽

Performance Patterns

AbstractA new rainfall retrieval technique for determining rainfall rates in a continuous manner (day, twilight, and night) resulting in a 24-h estimation applicable to midlatitudes is presented. The approach is based on satellite-derived information on cloud-top height, cloud-top temperature, cloud phase, and cloud water path retrieved from Meteosat Second Generation (MSG) Spinning Enhanced Visible and Infrared Imager (SEVIRI) data and uses the random forests (RF) machine-learning algorithm. The technique is realized in three steps: (i) precipitating cloud areas are identified, (ii) the areas are separated into convective and advective-stratiform precipitating areas, and (iii) rainfall rates are assigned separately to the convective and advective-stratiform precipitating areas. Validation studies were carried out for each individual step as well as for the overall procedure using collocated ground-based radar data. Regarding each individual step, the models for rain area and convective precipitation detection produce good results. Both retrieval steps show a general tendency toward elevated prediction skill during summer months and daytime. The RF models for rainfall-rate assignment exhibit similar performance patterns, yet it is noteworthy how well the model is able to predict rainfall rates during nighttime and twilight. The performance of the overall procedure shows a very promising potential to estimate rainfall rates at high temporal and spatial resolutions in an automated manner. The near-real-time continuous applicability of the technique with acceptable prediction performances at 3–8-hourly intervals is particularly remarkable. This provides a very promising basis for future investigations into precipitation estimation based on machine-learning approaches and MSG SEVIRI data.

Download Full-text

Peer Review #2 of "Pylogeny: an open-source Python framework for phylogenetic tree reconstruction and search space heuristics (v0.1)"

10.7287/peerj-cs.9v0.1/reviews/2 ◽

2015 ◽

Author(s):

J Sukumaran

Keyword(s):

Phylogenetic Tree ◽

Open Source ◽

Peer Review ◽

Search Space ◽

Tree Reconstruction ◽

Phylogenetic Tree Reconstruction

Download Full-text

A Ligand-Based Virtual Screening Method Using Direct Quantification of Generalization Ability

Molecules ◽

10.3390/molecules24132414 ◽

2019 ◽

Vol 24 (13) ◽

pp. 2414

Author(s):

Weixing Dai ◽

Dianjing Guo

Keyword(s):

Machine Learning ◽

Virtual Screening ◽

Learning Algorithm ◽

Screening Method ◽

Chemical Characteristic ◽

High Dimensional ◽

Learning Approaches ◽

Generalization Ability ◽

Model Interpretation ◽

Screening Accuracy

Machine learning plays an important role in ligand-based virtual screening. However, conventional machine learning approaches tend to be inefficient when dealing with such problems where the data are imbalanced and features describing the chemical characteristic of ligands are high-dimensional. We here describe a machine learning algorithm LBS (local beta screening) for ligand-based virtual screening. The unique characteristic of LBS is that it quantifies the generalization ability of screening directly by a refined loss function, and thus can assess the risk of over-fitting accurately and efficiently for imbalanced and high-dimensional data in ligand-based virtual screening without the help of resampling methods such as cross validation. The robustness of LBS was demonstrated by a simulation study and tests on real datasets, in which LBS outperformed conventional algorithms in terms of screening accuracy and model interpretation. LBS was then used for screening potential activators of HIV-1 integrase multimerization in an independent compound library, and the virtual screening result was experimentally validated. Of the 25 compounds tested, six were proved to be active. The most potent compound in experimental validation showed an EC50 value of 0.71 µM.

Download Full-text

Intelligent Web Search through Adaptive Learning from Relevance Feedback

Architectural Issues of Web-Enabled Electronic Business ◽

10.4018/978-1-59140-049-3.ch009 ◽

2011 ◽

pp. 140-154

Author(s):

Zhixiang Chen ◽

Binhai Zhu ◽

Xiannong Meng

Keyword(s):

Machine Learning ◽

Real Time ◽

Adaptive Learning ◽

Relevance Feedback ◽

Search Engines ◽

Web Search ◽

Learning Algorithm ◽

Future Research ◽

Learning Approaches ◽

Research Issues

In this chapter, machine-learning approaches to real-time intelligent Web search are discussed. The goal is to build an intelligent Web search system that can find the user’s desired information with as little relevance feedback from the user as possible. The system can achieve a significant search precision increase with a small number of iterations of user relevance feedback. A new machine-learning algorithm is designed as the core of the intelligent search component. This algorithm is applied to three different search engines with different emphases. This chapter presents the algorithm, the architectures, and the performances of these search engines. Future research issues regarding real-time intelligent Web search are also discussed.

Download Full-text

Predicting Type 2 Diabetes Using Logistic Regression and Machine Learning Approaches

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18147346 ◽

2021 ◽

Vol 18 (14) ◽

pp. 7346

Author(s):

Ram D. Joshi ◽

Chandra K. Dhakal

Keyword(s):

Machine Learning ◽

Type 2 Diabetes ◽

Logistic Regression ◽

Learning Algorithm ◽

Classification Tree ◽

Treatment Strategies ◽

Economic Loss ◽

Learning Approaches ◽

Considerable Morbidity

Diabetes mellitus is one of the most common human diseases worldwide and may cause several health-related complications. It is responsible for considerable morbidity, mortality, and economic loss. A timely diagnosis and prediction of this disease could provide patients with an opportunity to take the appropriate preventive and treatment strategies. To improve the understanding of risk factors, we predict type 2 diabetes for Pima Indian women utilizing a logistic regression model and decision tree—a machine learning algorithm. Our analysis finds five main predictors of type 2 diabetes: glucose, pregnancy, body mass index (BMI), diabetes pedigree function, and age. We further explore a classification tree to complement and validate our analysis. The six-fold classification tree indicates glucose, BMI, and age are important factors, while the ten-node tree implies glucose, BMI, pregnancy, diabetes pedigree function, and age as the significant predictors. Our preferred specification yields a prediction accuracy of 78.26% and a cross-validation error rate of 21.74%. We argue that our model can be applied to make a reasonable prediction of of type 2 diabetes, and could potentially be used to complement existing preventive measures to curb the incidence of diabetes and reduce associated costs.

Download Full-text

Machine Learning Algorithms to Classify and Quantify Multiple Behaviours in Dairy Calves Using a Sensor: Moving beyond Classification in Precision Livestock

Sensors ◽

10.3390/s21010088 ◽

2020 ◽

Vol 21 (1) ◽

pp. 88

Author(s):

Charles Carslake ◽

Jorge A. Vázquez-Diosdado ◽

Jasmeet Kaler

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Wearable Sensors ◽

Window Size ◽

Machine Learning Algorithms ◽

Dairy Calves ◽

Learning Approaches ◽

True Prevalence ◽

Play Behaviour ◽

Highly Correlated

Previous research has shown that sensors monitoring lying behaviours and feeding can detect early signs of ill health in calves. There is evidence to suggest that monitoring change in a single behaviour might not be enough for disease prediction. In calves, multiple behaviours such as locomotor play, self-grooming, feeding and activity whilst lying are likely to be informative. However, these behaviours can occur rarely in the real world, which means simply counting behaviours based on the prediction of a classifier can lead to overestimation. Here, we equipped thirteen pre-weaned dairy calves with collar-mounted sensors and monitored their behaviour with video cameras. Behavioural observations were recorded and merged with sensor signals. Features were calculated for 1–10-s windows and an AdaBoost ensemble learning algorithm implemented to classify behaviours. Finally, we developed an adjusted count quantification algorithm to predict the prevalence of locomotor play behaviour on a test dataset with low true prevalence (0.27%). Our algorithm identified locomotor play (99.73% accuracy), self-grooming (98.18% accuracy), ruminating (94.47% accuracy), non-nutritive suckling (94.96% accuracy), nutritive suckling (96.44% accuracy), active lying (90.38% accuracy) and non-active lying (90.38% accuracy). Our results detail recommended sampling frequencies, feature selection and window size. The quantification estimates of locomotor play behaviour were highly correlated with the true prevalence (0.97; p < 0.001) with a total overestimation of 18.97%. This study is the first to implement machine learning approaches for multi-class behaviour identification as well as behaviour quantification in calves. This has potential to contribute towards new insights to evaluate the health and welfare in calves by use of wearable sensors.

Download Full-text

Peer Review #1 of "Pylogeny: an open-source Python framework for phylogenetic tree reconstruction and search space heuristics (v0.2)"

10.7287/peerj-cs.9v0.2/reviews/1 ◽

2015 ◽

Keyword(s):

Phylogenetic Tree ◽

Open Source ◽

Peer Review ◽

Search Space ◽

Tree Reconstruction ◽

Phylogenetic Tree Reconstruction

Download Full-text