Ensemble learning from ensemble docking: revisiting the optimum ensemble size problem

AbstractDespite considerable advances obtained by applying machine learning approaches in protein–ligand affinity predictions, the incorporation of receptor flexibility has remained an important bottleneck. While ensemble docking has been used widely as a solution to this problem, the optimum choice of receptor conformations is still an open question considering the issues related to the computational cost and false positive pose predictions. Here, a combination of ensemble learning and ensemble docking is suggested to rank different conformations of the target protein in light of their importance for the final accuracy of the model. Available X-ray structures of cyclin-dependent kinase 2 (CDK2) in complex with different ligands are used as an initial receptor ensemble, and its redundancy is removed through a graph-based redundancy removal, which is shown to be more efficient and less subjective than clustering-based representative selection methods. A set of ligands with available experimental affinity are docked to this nonredundant receptor ensemble, and the energetic features of the best scored poses are used in an ensemble learning procedure based on the random forest method. The importance of receptors is obtained through feature selection measures, and it is shown that a few of the most important conformations are sufficient to reach 1 kcal/mol accuracy in affinity prediction with considerable improvement of the early enrichment power of the models compared to the different ensemble docking without learning strategies. A clear strategy has been provided in which machine learning selects the most important experimental conformers of the receptor among a large set of protein–ligand complexes while simultaneously maintaining the final accuracy of affinity predictions at the highest level possible for available data. Our results could be informative for future attempts to design receptor-specific docking-rescoring strategies.

Download Full-text

Combined Ensemble Docking and Machine Learning in Identification of Therapeutic Agents with Potential Inhibitory Effect on Human CES1

Molecules ◽

10.3390/molecules24152747 ◽

2019 ◽

Vol 24 (15) ◽

pp. 2747 ◽

Cited By ~ 1

Author(s):

Eliane Briand ◽

Ragnar Thomsen ◽

Kristian Linnet ◽

Henrik Berg Rasmussen ◽

Søren Brunak ◽

...

Keyword(s):

Machine Learning ◽

Drug Interactions ◽

Scoring Function ◽

Docking Study ◽

Inhibitory Effect ◽

Therapeutic Agents ◽

Learning Approaches ◽

Ensemble Docking ◽

Ic50 Values

The human carboxylesterase 1 (CES1), responsible for the biotransformation of many diverse therapeutic agents, may contribute to the occurrence of adverse drug reactions and therapeutic failure through drug interactions. The present study is designed to address the issue of potential drug interactions resulting from the inhibition of CES1. Based on an ensemble of 10 crystal structures complexed with different ligands and a set of 294 known CES1 ligands, we used docking (Autodock Vina) and machine learning methodologies (LDA, QDA and multilayer perceptron), considering the different energy terms from the scoring function to assess the best combination to enable the identification of CES1 inhibitors. The protocol was then applied on a library of 1114 FDA-approved drugs and eight drugs were selected for in vitro CES1 inhibition. An inhibition effect was observed for diltiazem (IC50 = 13.9 µM). Three others drugs (benztropine, iloprost and treprostinil), exhibited a weak CES1 inhibitory effects with IC50 values of 298.2 µM, 366.8 µM and 391.6 µM respectively. In conclusion, the binding site of CES1 is relatively flexible and can adapt its conformation to different types of ligands. Combining ensemble docking and machine learning approaches improves the prediction of CES1 inhibitors compared to a docking study using only one crystal structure.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

Fault diagnosis of various rotating equipment using machine learning approaches – A review

Proceedings of the Institution of Mechanical Engineers Part E Journal of Process Mechanical Engineering ◽

10.1177/0954408920971976 ◽

2020 ◽

pp. 095440892097197

Author(s):

S Manikandan ◽

K Duraivelu

Keyword(s):

Machine Learning ◽

Fault Diagnosis ◽

Learning Strategies ◽

Early Identification ◽

Deep Neural Network ◽

Fundamental Aspect ◽

Learning Approaches ◽

Multiple Components ◽

Rotating Equipment

Fault diagnosis of various rotating equipment plays a significant role in industries as it guarantees safety, reliability and prevents breakdown and loss of any source of energy. Early identification is a fundamental aspect for diagnosing the faults which saves both time and costs and in fact it avoids perilous conditions. Investigations are being carried out for intelligent fault diagnosis using machine learning approaches. This article analyses various machine learning approaches used for fault diagnosis of rotating equipment. In addition to this, a detailed study of different machine learning strategies which are incorporated on various rotating equipment in the context of fault diagnosis is also carried out. Mainly, the benefits and advance patterns of deep neural network which are applied to multiple components for fault diagnosis are inspected in this study. Finally, different algorithms are proposed to propagate the quality of fault diagnosis and the conceivable research ideas of applying machine learning approaches on various rotating equipment are condensed in this article.

Download Full-text

Iterative machine learning applied to annotation of text datasets

10.5753/eniac.2021.18268 ◽

2021 ◽

Author(s):

Thiago Abdo ◽

Fabiano Silva

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Learning Algorithms ◽

Computational Cost ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Learning Techniques ◽

The Creation ◽

The Impact ◽

High Computational Cost

The purpose of this paper is to analyze the use of different machine learning approaches and algorithms to be integrated as an automated assistance on a tool to aid the creation of new annotated datasets. We evaluate how they scale in an environment without dedicated machine learning hardware. In particular, we study the impact over a dataset with few examples and one that is being constructed. We experiment using deep learning algorithms (Bert) and classical learning algorithms with a lower computational cost (W2V and Glove combined with RF and SVM). Our experiments show that deep learning algorithms have a performance advantage over classical techniques. However, deep learning algorithms have a high computational cost, making them inadequate to an environment with reduced hardware resources. Simulations using Active and Iterative machine learning techniques to assist the creation of new datasets are conducted. For these simulations, we use the classical learning algorithms because of their computational cost. The knowledge gathered with our experimental evaluation aims to support the creation of a tool for building new text datasets.

Download Full-text

Harnessing machine learning to boost heuristic strategies for phylogenetic-tree search

10.21203/rs.3.rs-48247/v1 ◽

2020 ◽

Author(s):

Dana Azouri ◽

Shiran Abadi ◽

Yishay Mansour ◽

Itay Mayrose ◽

Tal Pupko

Keyword(s):

Machine Learning ◽

Phylogenetic Tree ◽

Learning Algorithm ◽

Search Space ◽

Large Set ◽

Tree Search ◽

Learning Approaches ◽

Tree Reconstruction ◽

Heuristic Strategies ◽

Tree Inference

Abstract Inferring a phylogenetic tree, which describes the evolutionary relationships among a set of organisms, genes, or genomes, is a fundamental step in numerous evolutionary studies. With the aim of making tree inference feasible for problems involving more than a handful of sequences, current algorithms for phylogenetic tree reconstruction utilize various heuristic approaches. Such approaches rely on performing costly likelihood optimizations, and thus evaluate only a subset of all potential trees. Consequently, all existing methods suffer from the known tradeoff between accuracy and running time. Here, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus avoiding numerous expensive likelihood computations. Our analyses suggest that machine-learning approaches can make heuristic tree searches substantially faster without losing accuracy and thus could be incorporated for narrowing down the examined neighboring trees of each intermediate tree in any tree search methodology.

Download Full-text

Ensemble Learning Approaches Based on Covariance Pooling of CNN Features for High Resolution Remote Sensing Scene Classification

Remote Sensing ◽

10.3390/rs12203292 ◽

2020 ◽

Vol 12 (20) ◽

pp. 3292

Author(s):

Sara Akodad ◽

Lionel Bombrun ◽

Junshi Xia ◽

Yannick Berthoumieu ◽

Christian Germain

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Ensemble Learning ◽

Covariance Matrices ◽

Machine Learning Algorithms ◽

Learning Approach ◽

Learning Approaches ◽

Scene Classification ◽

First Order ◽

Fisher Vector Encoding

Remote sensing image scene classification, which consists of labeling remote sensing images with a set of categories based on their content, has received remarkable attention for many applications such as land use mapping. Standard approaches are based on the multi-layer representation of first-order convolutional neural network (CNN) features. However, second-order CNNs have recently been shown to outperform traditional first-order CNNs for many computer vision tasks. Hence, the aim of this paper is to show the use of second-order statistics of CNN features for remote sensing scene classification. This takes the form of covariance matrices computed locally or globally on the output of a CNN. However, these datapoints do not lie in an Euclidean space but a Riemannian manifold. To manipulate them, Euclidean tools are not adapted. Other metrics should be considered such as the log-Euclidean one. This consists of projecting the set of covariance matrices on a tangent space defined at a reference point. In this tangent plane, which is a vector space, conventional machine learning algorithms can be considered, such as the Fisher vector encoding or SVM classifier. Based on this log-Euclidean framework, we propose a novel transfer learning approach composed of two hybrid architectures based on covariance pooling of CNN features, the first is local and the second is global. They rely on the extraction of features from models pre-trained on the ImageNet dataset processed with some machine learning algorithms. The first hybrid architecture consists of an ensemble learning approach with the log-Euclidean Fisher vector encoding of region covariance matrices computed locally on the first layers of a CNN. The second one concerns an ensemble learning approach based on the covariance pooling of CNN features extracted globally from the deepest layers. These two ensemble learning approaches are then combined together based on the strategy of the most diverse ensembles. For validation and comparison purposes, the proposed approach is tested on various challenging remote sensing datasets. Experimental results exhibit a significant gain of approximately 2% in overall accuracy for the proposed approach compared to a similar state-of-the-art method based on covariance pooling of CNN features (on the UC Merced dataset).

Download Full-text

Machine Learning Approaches over Big Data for Health Care Systems: A review

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1630 ◽

2021 ◽

pp. 191-198

Author(s):

Pracheta J. Raut ◽

Prof. Avantika Mahadik

Keyword(s):

Machine Learning ◽

Health Care ◽

Big Data ◽

Ensemble Learning ◽

Big Data Analytics ◽

Health Care Industry ◽

Digital Data ◽

Learning Approaches ◽

Advantages And Disadvantages ◽

Care Industry

Today the digital data that world produces is unseen and spectacular. The data from social media, e-commerce and Internet of things generate approximately 2.5 quintillion of bytes per day. This amount is equals 100 million Blu-ray discs or almost 30,000 GB per second. Till today data is growing and will continue to grow in future. In the field of Health care industry, big data has opened new ways to acquire intelligence and data analysis. Collected records from patient, hospital, doctors, medical treatment is known as health care big data. Big data by machine learning are assembled and evaluates the large amount of data in health care. Analytic process and business intelligence (BI) is growing up day by day, as it acquires knowledge and makes right decision. As it is vast and complex growing data, it is very difficult to store. The tradition method of handling big data is incapable to manage and process big data. Hence to resolve this difficulty, some machine learning tools are applied on large amount of data using big data analytics framework. Researchers have proposed some machine learning approaches to improve the accuracy of analytics. Each technique is applied, and their results are compared. And this concluded that we get accurate result from one machine learning approach are called as Ensemble Learning. The final result observed that ensemble learning can obtain high accuracy. In this paper we shall study about various methods to process big data for machine learning and its statistic approaches. Further we study various tools for storing of big data, its advantages, and disadvantages in the field of health care industry.

Download Full-text

Ensemble learning predicts multiple sclerosis disease course in the SUMMIT study

npj Digital Medicine ◽

10.1038/s41746-020-00338-8 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Yijun Zhao ◽

◽

Tong Wang ◽

Riley Bove ◽

Bruce Cree ◽

...

Keyword(s):

Machine Learning ◽

Multiple Sclerosis ◽

San Francisco ◽

Ensemble Learning ◽

Disease Onset ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Disease Course ◽

Learning Techniques

AbstractThe rate of disability accumulation varies across multiple sclerosis (MS) patients. Machine learning techniques may offer more powerful means to predict disease course in MS patients. In our study, 724 patients from the Comprehensive Longitudinal Investigation in MS at Brigham and Women’s Hospital (CLIMB study) and 400 patients from the EPIC dataset, University of California, San Francisco, were included in the analysis. The primary outcome was an increase in Expanded Disability Status Scale (EDSS) ≥ 1.5 (worsening) or not (non-worsening) at up to 5 years after the baseline visit. Classification models were built using the CLIMB dataset with patients’ clinical and MRI longitudinal observations in first 2 years, and further validated using the EPIC dataset. We compared the performance of three popular machine learning algorithms (SVM, Logistic Regression, and Random Forest) and three ensemble learning approaches (XGBoost, LightGBM, and a Meta-learner L). A “threshold” was established to trade-off the performance between the two classes. Predictive features were identified and compared among different models. Machine learning models achieved 0.79 and 0.83 AUC scores for the CLIMB and EPIC datasets, respectively, shortly after disease onset. Ensemble learning methods were more effective and robust compared to standalone algorithms. Two ensemble models, XGBoost and LightGBM were superior to the other four models evaluated in our study. Of variables evaluated, EDSS, Pyramidal Function, and Ambulatory Index were the top common predictors in forecasting the MS disease course. Machine learning techniques, in particular ensemble methods offer increased accuracy for the prediction of MS disease course.

Download Full-text

Performance and clinical utility of supervised machine-learning approaches in detecting familial hypercholesterolaemia in primary care

npj Digital Medicine ◽

10.1038/s41746-020-00349-5 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Ralph K. Akyea ◽

Nadeem Qureshi ◽

Joe Kai ◽

Stephen F. Weng

Keyword(s):

Machine Learning ◽

Primary Care ◽

Logistic Regression ◽

Heart Disease ◽

Ensemble Learning ◽

Clinical Utility ◽

Familial Hypercholesterolaemia ◽

Predictive Accuracy ◽

Gradient Boosting ◽

Learning Approaches

Abstract Familial hypercholesterolaemia (FH) is a common inherited disorder, causing lifelong elevated low-density lipoprotein cholesterol (LDL-C). Most individuals with FH remain undiagnosed, precluding opportunities to prevent premature heart disease and death. Some machine-learning approaches improve detection of FH in electronic health records, though clinical impact is under-explored. We assessed performance of an array of machine-learning approaches for enhancing detection of FH, and their clinical utility, within a large primary care population. A retrospective cohort study was done using routine primary care clinical records of 4,027,775 individuals from the United Kingdom with total cholesterol measured from 1 January 1999 to 25 June 2019. Predictive accuracy of five common machine-learning algorithms (logistic regression, random forest, gradient boosting machines, neural networks and ensemble learning) were assessed for detecting FH. Predictive accuracy was assessed by area under the receiver operating curves (AUC) and expected vs observed calibration slope; with clinical utility assessed by expected case-review workload and likelihood ratios. There were 7928 incident diagnoses of FH. In addition to known clinical features of FH (raised total cholesterol or LDL-C and family history of premature coronary heart disease), machine-learning (ML) algorithms identified features such as raised triglycerides which reduced the likelihood of FH. Apart from logistic regression (AUC, 0.81), all four other ML approaches had similarly high predictive accuracy (AUC > 0.89). Calibration slope ranged from 0.997 for gradient boosting machines to 1.857 for logistic regression. Among those screened, high probability cases requiring clinical review varied from 0.73% using ensemble learning to 10.16% using deep learning, but with positive predictive values of 15.5% and 2.8% respectively. Ensemble learning exhibited a dominant positive likelihood ratio (45.5) compared to all other ML models (7.0–14.4). Machine-learning models show similar high accuracy in detecting FH, offering opportunities to increase diagnosis. However, the clinical case-finding workload required for yield of cases will differ substantially between models.

Download Full-text

Study on Ontology Ranking Models Based on the Ensemble Learning

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2018040107 ◽

2018 ◽

Vol 14 (2) ◽

pp. 138-161

Author(s):

Liu Jie ◽

Yuan Kerou ◽

Zhou Jianshe ◽

Shi Jinsheng

Keyword(s):

Machine Learning ◽

Learning Strategies ◽

Ensemble Learning ◽

The Internet ◽

Feature Ranking ◽

Ranking Algorithms ◽

Ranking Models ◽

Single Feature ◽

Ontology Ranking ◽

Internal Character

This article describes how more knowledge appears on the Internet than in an ontological form. Displaying results to users precisely when searching is the key issue of the research on ontology retrieval. The considered factors of ontology ranking are not only limited to internal character-matching, but analysis of metadata, including the entities, structures and the relations in ontologies. Currently, existing single feature ranking algorithms focus on the structures, elements and the contents of a certain aspect in ontology, thus, the results are not satisfactory. Combining multiple single-featured models seems to achieve better results, but the objectivity and versatility of models' weights are debatable. Machine learning effectively solves the problem and putting advantages of ranking learning algorithms together is the pressing issue. So we propose ensemble learning strategies to combine different algorithms in ontology ranking. And the ranking result is more satisfied compared to Swoogle and base algorithms.

Download Full-text