scholarly journals An interpretable classification method for predicting drug resistance in M. tuberculosis

Author(s):  
Hooman Zabeti ◽  
Nick Dexter ◽  
Amir Hosein Safari ◽  
Nafiseh Sedaghat ◽  
Maxwell Libbrecht ◽  
...  

AbstractMotivationThe prediction of drug resistance and the identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Modern methods based on testing against a catalogue of previously identified mutations often yield poor predictive performance. On the other hand, machine learning techniques have demonstrated high predictive accuracy, but many of them lack interpretability to aid in identifying specific mutations which lead to resistance. We propose a novel technique, inspired by the group testing problem and Boolean compressed sensing, which yields highly accurate predictions and interpretable results at the same time.ResultsWe develop a modified version of the Boolean compressed sensing problem for identifying drug resistance, and implement its formulation as an integer linear program. This allows us to characterize the predictive accuracy of the technique and select an appropriate metric to optimize. A simple adaptation of the problem also allows us to quantify the sensitivity-specificity trade-off of our model under different regimes. We test the predictive accuracy of our approach on a variety of commonly used antibiotics in treating tuberculosis and find that it has accuracy comparable to that of standard machine learning models and points to several genes with previously identified association to drug resistance.Availabilityhttps://github.com/hoomanzabeti/[email protected]

2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Hooman Zabeti ◽  
Nick Dexter ◽  
Amir Hosein Safari ◽  
Nafiseh Sedaghat ◽  
Maxwell Libbrecht ◽  
...  

Abstract Motivation Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data. Contribution In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time. Results We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library.


2020 ◽  
Vol 28 (2) ◽  
pp. 253-265 ◽  
Author(s):  
Gabriela Bitencourt-Ferreira ◽  
Amauri Duarte da Silva ◽  
Walter Filgueira de Azevedo

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.


Author(s):  
Ramakanta Mohanty ◽  
Vadlamani Ravi

The past 10 years have seen the prediction of software defects proposed by many researchers using various metrics based on measurable aspects of source code entities (e.g. methods, classes, files or modules) and the social structure of software project in an effort to predict the software defects. However, these metrics could not predict very high accuracies in terms of sensitivity, specificity and accuracy. In this chapter, we propose the use of machine learning techniques to predict software defects. The effectiveness of all these techniques is demonstrated on ten datasets taken from literature. Based on an experiment, it is observed that PNN outperformed all other techniques in terms of accuracy and sensitivity in all the software defects datasets followed by CART and Group Method of data handling. We also performed feature selection by t-statistics based approach for selecting feature subsets across different folds for a given technique and followed by the feature subset selection. By taking the most important variables, we invoked the classifiers again and observed that PNN outperformed other classifiers in terms of sensitivity and accuracy. Moreover, the set of ‘if- then rules yielded by J48 and CART can be used as an expert system for prediction of software defects.


2017 ◽  
Author(s):  
Ari S. Benjamin ◽  
Hugo L. Fernandes ◽  
Tucker Tomlinson ◽  
Pavan Ramkumar ◽  
Chris VerSteeg ◽  
...  

AbstractNeuroscience has long focused on finding encoding models that effectively ask “what predicts neural spiking?” and generalized linear models (GLMs) are a typical approach. It is often unknown how much of explainable neural activity is captured, or missed, when fitting a GLM. Here we compared the predictive performance of GLMs to three leading machine learning methods: feedforward neural networks, gradient boosted trees (using XGBoost), and stacked ensembles that combine the predictions of several methods. We predicted spike counts in macaque motor (M1) and somatosensory (S1) cortices from standard representations of reaching kinematics, and in rat hippocampal cells from open field location and orientation. In general, the modern methods (particularly XGBoost and the ensemble) produced more accurate spike predictions and were less sensitive to the preprocessing of features. This discrepancy in performance suggests that standard feature sets may often relate to neural activity in a nonlinear manner not captured by GLMs. Encoding models built with machine learning techniques, which can be largely automated, more accurately predict spikes and can offer meaningful benchmarks for simpler models.


Author(s):  
Subhendu Kumar Pani ◽  
Bikram Kesari Ratha ◽  
Ajay Kumar Mishra

Microarray technology of DNA permits simultaneous monitoring and determining of thousands of gene expression activation levels in a single experiment. Data mining technique such as classification is extensively used on microarray data for medical diagnosis and gene analysis. However, high dimensionality of the data affects the performance of classification and prediction. Consequently, a key issue in microarray data is feature selection and dimensionality reduction in order to achieve better classification and predictive accuracy. There are several machine learning approaches available for feature selection. In this study, the authors use Particle Swarm Organization (PSO) and Genetic Algorithm (GA) to find the performance of several popular classifiers on a set of microarray datasets. Experimental results conclude that feature selection affects the performance.


2021 ◽  
Author(s):  
Massimiliano Greco ◽  
Giovanni Angelotti ◽  
Pier Francesco Caruso ◽  
Alberto Zanella ◽  
Niccolò Stomeo ◽  
...  

Abstract Introduction: SARS-CoV-2 infection was first identified at the end of 2019 in China, and subsequently spread globally. COVID-19 disease frequently affects the lungs leading to bilateral viral pneumonia, progressing in some cases to severe respiratory failure requiring ICU admission and mechanical ventilation. Risk stratification at ICU admission is fundamental for resource allocation and decision making, considering that baseline comorbidities, age, and patient conditions at admission have been associated to poorer outcomes. Supervised machine learning techniques are increasingly diffuse in clinical medicine and can predict mortality and test associations reaching high predictive performance. We assessed performances of a machine learning approach to predict mortality in COVID-19 patients admitted to ICU using data from the Lombardy ICU Network.Methods: this is a secondary analysis of prospectively collected data from Lombardy ICU network. To predict survival at 7-,14- and 28 days we built two different models; model A included patient demographics, medications before admission and comorbidities, while model B also included the data of the first day since ICU admission. 10-fold cross validation was repeated 2500 times, to ensure optimal hyperparameter choice. The only constrain imposed to model optimization was the choice of logistic regression as final layer to increase clinical interpretability. Different imputation and over-sampling techniques were employed in model training.Results 1503 patients were included, with 766 deaths (51%). Exploratory analysis and Kaplan-Meier curves demonstrated mortality association with age and gender. Model A and B reached the greatest predictive performance at 28 days (AUC 0.77 and 0.79), with lower performance at 14 days (AUC 0.72 and 0.74) and 7 days (AUC 0.68 and 0.71). Male gender, age and number of comorbidities were strongly associated with mortality in both models. Among comorbidities, chronic kidney disease and chronic obstructive pulmonary disease demonstrated association. Mode of ventilatory assistance at ICU admission and Fraction of Inspired oxygen were associated with mortality in model B.Conclusions Supervised machine learning models demonstrated good performance in prediction of 28-day mortality. 7-days and 14-days predictions demonstrated lower performance. Machine learning techniques may be useful in emergency phases to reach higher predictive performance with reduced human supervision using complex data.


2019 ◽  
Vol 14 (6) ◽  
pp. 670-690 ◽  
Author(s):  
Ajeet Singh ◽  
Anurag Jain

Credit card fraud is one of the flip sides of the digital world, where transactions are made without the knowledge of the genuine user. Based on the study of various papers published between 1994 and 2018 on credit card fraud, the following objectives are achieved: the various types of credit card frauds has identified and to detect automatically these frauds, an adaptive machine learning techniques (AMLTs) has studied and also their pros and cons has summarized. The various dataset are used in the literature has studied and categorized into the real and synthesized datasets.The performance matrices and evaluation criteria have summarized which has used to evaluate the fraud detection system.This study has also covered the deep analysis and comparison of the performance (i.e sensitivity, specificity, and accuracy) of existing machine learning techniques in the credit card fraud detection area.The findings of this study clearly show that supervised learning, card-not-present fraud, skimming fraud, and website cloning method has been used more frequently.This Study helps to new researchers by discussing the limitation of existing fraud detection techniques and providing helpful directions of research in the credit card fraud detection field.


2021 ◽  
Vol 28 ◽  
Author(s):  
Martina Veit-Acosta ◽  
Walter Filgueira de Azevedo Junior

Background: CDK2 participates in the control of eukaryotic cell-cycle progression. Due to the great interest in CDK2 for drug development and the relative easiness in crystallizing this enzyme, we have over 400 structural studies focused on this protein target. This structural data is the basis for the development of computational models to estimate CDK2-ligand binding affinity. Objective: This work focuses on the recent developments in the application of supervised machine learning modeling to develop scoring functions to predict the binding affinity of CDK2. Method: We employed the structures available at the protein data bank and the ligand information accessed from the BindingDB, Binding MOAD, and PDBbind to evaluate the predictive performance of machine learning techniques combined with physical modeling used to calculate binding affinity. We compared this hybrid methodology with classical scoring functions available in docking programs. Results: Our comparative analysis of previously published models indicated that a model created using a combination of a mass-spring system and cross-validated Elastic Net to predict the binding affinity of CDK2-inhibitor complexes outperformed classical scoring functions available in AutoDock4 and AutoDock Vina. Conclusion: All studies reviewed here suggest that targeted machine learning models are superior to classical scoring functions to calculate binding affinities. Specifically for CDK2, we see that the combination of physical modeling with supervised machine learning techniques exhibits improved predictive performance to calculate the protein-ligand binding affinity. These results find theoretical support in the application of the concept of scoring function space.


Materials ◽  
2020 ◽  
Vol 13 (23) ◽  
pp. 5570
Author(s):  
Young Min Wie ◽  
Ki Gang Lee ◽  
Kang Hyuck Lee ◽  
Taehoon Ko ◽  
Kang Hoon Lee

The purpose of this study is to experimentally design the drying, calcination, and sintering processes of artificial lightweight aggregates through the orthogonal array, to expand the data using the results, and to model the manufacturing process of lightweight aggregates through machine-learning techniques. The experimental design of the process consisted of L18(3661), which means that 36 × 61 data can be obtained in 18 experiments using an orthogonal array design. After the experiment, the data were expanded to 486 instances and trained by several machine-learning techniques such as linear regression, random forest, and support vector regression (SVR). We evaluated the predictive performance of machine-learning models by comparing predicted and actual values. As a result, the SVR showed the best performance for predicting measured values. This model also worked well for predictions of untested cases.


Sign in / Sign up

Export Citation Format

Share Document