In silico prediction of chemical neurotoxicity using machine learning

Changsheng Jiang; Piaopiao Zhao; Weihua Li; Yun Tang; Guixia Liu

doi:10.1093/toxres/tfaa016

In silico prediction of chemical neurotoxicity using machine learning

Toxicology Research ◽

10.1093/toxres/tfaa016 ◽

2020 ◽

Vol 9 (3) ◽

pp. 164-172

Author(s):

Changsheng Jiang ◽

Piaopiao Zhao ◽

Weihua Li ◽

Yun Tang ◽

Guixia Liu

Keyword(s):

Machine Learning ◽

Regression Models ◽

Cross Validation ◽

Prediction Models ◽

Drug Withdrawal ◽

Molecular Descriptors ◽

Computational Prediction ◽

Machine Learning Algorithms ◽

Training Set ◽

Data Set

Abstract Neurotoxicity is one of the main causes of drug withdrawal, and the biological experimental methods of detecting neurotoxic toxicity are time-consuming and laborious. In addition, the existing computational prediction models of neurotoxicity still have some shortcomings. In response to these shortcomings, we collected a large number of data set of neurotoxicity and used PyBioMed molecular descriptors and eight machine learning algorithms to construct regression prediction models of chemical neurotoxicity. Through the cross-validation and test set validation of the models, it was found that the extra-trees regressor model had the best predictive effect on neurotoxicity (${q}_{\mathrm{test}}^2$ = 0.784). In addition, we get the applicability domain of the models by calculating the standard deviation distance and the lever distance of the training set. We also found that some molecular descriptors are closely related to neurotoxicity by calculating the contribution of the molecular descriptors to the models. Considering the accuracy of the regression models, we recommend using the extra-trees regressor model to predict the chemical autonomic neurotoxicity.

Download Full-text

Predicting Loan Approval of Bank Direct Marketing Data Using Ensemble Machine Learning Algorithms

International Journal of Circuits, Systems and Signal Processing ◽

10.46300/9106.2020.14.117 ◽

2020 ◽

Vol 14 ◽

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Prediction Models ◽

Machine Learning Algorithms ◽

Decision Makers ◽

Machine Learning Techniques ◽

Data Set ◽

Ensemble Machine Learning ◽

Marketing Data ◽

Loan Approval

The Bank Marketing data set at Kaggle is mostly used in predicting if bank clients will subscribe a long-term deposit. We believe that this data set could provide more useful information such as predicting whether a bank client could be approved for a loan. This is a critical choice that has to be made by decision makers at the bank. Building a prediction model for such high-stakes decision does not only require high model prediction accuracy, but also needs a reasonable prediction interpretation. In this research, different ensemble machine learning techniques have been deployed such as Bagging and Boosting. Our research results showed that the loan approval prediction model has an accuracy of 83.97%, which is approximately 25% better than most state-of-the-art other loan prediction models found in the literature. As well, the model interpretation efforts done in this research was able to explain a few critical cases that the bank decision makers may encounter; therefore, the high accuracy of the designed models was accompanied with a trust in prediction. We believe that the achieved model accuracy accompanied with the provided interpretation information are vitally needed for decision makers to understand how to maintain balance between security and reliability of their financial lending system, while providing fair credit opportunities to their clients.

Download Full-text

Prediction of Cardiac Arrest in the Emergency Department Based on Machine Learning and Sequential Characteristics: Model Development and Retrospective Clinical Validation Study (Preprint)

10.2196/preprints.15932 ◽

2019 ◽

Author(s):

Sungjun Hong ◽

Sungjoo Lee ◽

Jeonghoon Lee ◽

Won Chul Cha ◽

Kyunga Kim

Keyword(s):

Machine Learning ◽

Emergency Department ◽

Cardiac Arrest ◽

Prediction Model ◽

Prediction Models ◽

Characteristic Curve ◽

Machine Learning Algorithms ◽

Clinical Usefulness ◽

Class Imbalance Problem ◽

Data Set

BACKGROUND The development and application of clinical prediction models using machine learning in clinical decision support systems is attracting increasing attention. OBJECTIVE The aims of this study were to develop a prediction model for cardiac arrest in the emergency department (ED) using machine learning and sequential characteristics and to validate its clinical usefulness. METHODS This retrospective study was conducted with ED patients at a tertiary academic hospital who suffered cardiac arrest. To resolve the class imbalance problem, sampling was performed using propensity score matching. The data set was chronologically allocated to a development cohort (years 2013 to 2016) and a validation cohort (year 2017). We trained three machine learning algorithms with repeated 10-fold cross-validation. RESULTS The main performance parameters were the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). The random forest algorithm (AUROC 0.97; AUPRC 0.86) outperformed the recurrent neural network (AUROC 0.95; AUPRC 0.82) and the logistic regression algorithm (AUROC 0.92; AUPRC=0.72). The performance of the model was maintained over time, with the AUROC remaining at least 80% across the monitored time points during the 24 hours before event occurrence. CONCLUSIONS We developed a prediction model of cardiac arrest in the ED using machine learning and sequential characteristics. The model was validated for clinical usefulness by chronological visualization focused on clinical usability.

Download Full-text

A Review of Regression Models in Machine Learning

10.51682/jiscom.00202005.2021 ◽

2021 ◽

Vol 2 (2) ◽

pp. 40-47

Author(s):

Sunil Kumar ◽

Vaibhav Bhatnagar

Keyword(s):

Machine Learning ◽

Regression Analysis ◽

Regression Model ◽

Regression Models ◽

Machine Learning Algorithms ◽

Data Sets ◽

Analysis Model ◽

Data Set ◽

Data Regression ◽

Different Types

Machine learning is one of the active fields and technologies to realize artificial intelligence (AI). The complexity of machine learning algorithms creates problems to predict the best algorithm. There are many complex algorithms in machine learning (ML) to determine the appropriate method for finding regression trends, thereby establishing the correlation association in the middle of variables is very difficult, we are going to review different types of regressions used in Machine Learning. There are mainly six types of regression model Linear, Logistic, Polynomial, Ridge, Bayesian Linear and Lasso. This paper overview the above-mentioned regression model and will try to find the comparison and suitability for Machine Learning. A data analysis prerequisite to launch an association amongst the innumerable considerations in a data set, association is essential for forecast and exploration of data. Regression Analysis is such a procedure to establish association among the datasets. The effort on this paper predominantly emphases on the diverse regression analysis model, how they binning to custom in context of different data sets in machine learning. Selection the accurate model for exploration is the most challenging assignment and hence, these models considered thoroughly in this study. In machine learning by these models in the perfect way and thru accurate data set, data exploration and forecast can provide the maximum exact outcomes.

Download Full-text

Prediction of Cardiac Arrest in the Emergency Department Based on Machine Learning and Sequential Characteristics: Model Development and Retrospective Clinical Validation Study

JMIR Medical Informatics ◽

10.2196/15932 ◽

2020 ◽

Vol 8 (8) ◽

pp. e15932

Author(s):

Sungjun Hong ◽

Sungjoo Lee ◽

Jeonghoon Lee ◽

Won Chul Cha ◽

Kyunga Kim

Keyword(s):

Machine Learning ◽

Emergency Department ◽

Cardiac Arrest ◽

Prediction Model ◽

Prediction Models ◽

Characteristic Curve ◽

Machine Learning Algorithms ◽

Clinical Usefulness ◽

Class Imbalance Problem ◽

Data Set

Background The development and application of clinical prediction models using machine learning in clinical decision support systems is attracting increasing attention. Objective The aims of this study were to develop a prediction model for cardiac arrest in the emergency department (ED) using machine learning and sequential characteristics and to validate its clinical usefulness. Methods This retrospective study was conducted with ED patients at a tertiary academic hospital who suffered cardiac arrest. To resolve the class imbalance problem, sampling was performed using propensity score matching. The data set was chronologically allocated to a development cohort (years 2013 to 2016) and a validation cohort (year 2017). We trained three machine learning algorithms with repeated 10-fold cross-validation. Results The main performance parameters were the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). The random forest algorithm (AUROC 0.97; AUPRC 0.86) outperformed the recurrent neural network (AUROC 0.95; AUPRC 0.82) and the logistic regression algorithm (AUROC 0.92; AUPRC=0.72). The performance of the model was maintained over time, with the AUROC remaining at least 80% across the monitored time points during the 24 hours before event occurrence. Conclusions We developed a prediction model of cardiac arrest in the ED using machine learning and sequential characteristics. The model was validated for clinical usefulness by chronological visualization focused on clinical usability.

Download Full-text

A machine learning approach to aerosol classification for single-particle mass spectrometry

Atmospheric Measurement Techniques ◽

10.5194/amt-11-5687-2018 ◽

2018 ◽

Vol 11 (10) ◽

pp. 5687-5699 ◽

Cited By ~ 8

Author(s):

Costa D. Christopoulos ◽

Sarvesh Garimella ◽

Maria A. Zawadowicz ◽

Ottmar Möhler ◽

Daniel J. Cziczo

Keyword(s):

Machine Learning ◽

Mass Spectrometry ◽

Single Particle ◽

Classification Accuracy ◽

Machine Learning Algorithms ◽

Particle Mass ◽

Training Set ◽

Data Set ◽

Analytical Technique ◽

Single Particle Mass Spectrometry

Abstract. Compositional analysis of atmospheric and laboratory aerosols is often conducted via single-particle mass spectrometry (SPMS), an in situ and real-time analytical technique that produces mass spectra on a single-particle basis. In this study, classifiers are created using a data set of SPMS spectra to automatically differentiate particles on the basis of chemistry and size. Machine learning algorithms build a predictive model from a training set for which the aerosol type associated with each mass spectrum is known a priori. Our primary focus surrounds the growing of random forests using feature selection to reduce dimensionality and the evaluation of trained models with confusion matrices. In addition to classifying ∼20 unique, but chemically similar, aerosol types, models were also created to differentiate aerosol within four broader categories: fertile soils, mineral/metallic particles, biological particles, and all other aerosols. Differentiation was accomplished using ∼40 positive and negative spectral features. For the broad categorization, machine learning resulted in a classification accuracy of ∼93 %. Classification of aerosols by specific type resulted in a classification accuracy of ∼87 %. The “trained” model was then applied to a “blind” mixture of aerosols which was known to be a subset of the training set. Model agreement was found on the presence of secondary organic aerosol, coated and uncoated mineral dust, and fertile soil.

Download Full-text

Dynamic Analytics and Forecasting Model for Covid-19 Using Machine Learning Algorithms

Webology ◽

10.14704/web/v18si05/web18302 ◽

2021 ◽

Vol 18 (05) ◽

pp. 1212-1225

Author(s):

Siva C ◽

Maheshwari K.G ◽

Nalinipriya G ◽

Priscilla Mary J

Keyword(s):

Machine Learning ◽

Real Time ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

Prediction Models ◽

Vital Role ◽

Machine Learning Algorithms ◽

World Health ◽

Medical Emergency ◽

Data Set

In our day to day life, the availability of correctly labelled data as well as handling of categorical data are mostly acknowledged as two main challenges in dynamic analysis. Therefore, clustering techniques are applied on unlabelled data to group them in accordance with the homogeneity. There are many prediction methods that are being popularly used in handling forecasting problems in real time environment. The outbreak of coronavirus disease (COVID19)-2019 creates the need for a medical emergency of worldwide concern with a rapidly high danger of open out and strike the entire world. Recently, the ML prediction models were used in many real time applications which necessitate the identification and categorization for real time environment. In medical field Prediction models are vital role to obtain observations of spread and significances of infectious diseases. Machine learning related forecasting mechanisms have showed their importance to develop the decision making on the upcoming course of actions. The K-means algorithm and hierarchy were applied directly on the renewed dataset using R programming language to create the covid patient cluster. Confirmed Covid patients count are passed to Prophet package, then the prophet model has been created. This forecasts model predicts the future covid count, which is essential for the clinical and healthcare leaders to make the appropriate measures in advance. The results of the experiments indicate that the quality of Hierarchical clustering outperforms than the K-Means clustering algorithm in the structured structured dataset. Thus, the prediction model also used to support model predictions help for the officials to take timely actions and make decisions to contain the COVID-19 dilemma. This work concludes Hierarchical clustering algorithm is the best model for clustering the covid data set obtained from world health organization (WHO).

Download Full-text

Clinical Classifiers to Identify Ascending Aortic Dilatation in Patients With Bicuspid Versus Tricuspid Aortic Valves

10.21203/rs.3.rs-957446/v1 ◽

2021 ◽

Author(s):

Bamba Gaye ◽

Maxime Vignac ◽

Jesper R. Gådin ◽

Magalie Ladouceur ◽

Kenneth Caidahl ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Aortic Valve ◽

Regression Models ◽

Prediction Models ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Multidimensional Data ◽

Aortic Dilatation ◽

Logistic Regression Models

Abstract Objective: We aimed to develop clinical classifiers to identify prevalent ascending aortic dilatation in patients with BAV and tricuspid aortic valve (TAV). Methods: This study included BAV (n=543) and TAV (n=491) patients with aortic valve disease and/or ascending aortic dilatation but devoid of coronary artery disease undergoing cardiothoracic surgery. We applied machine learning algorithms and classic logistic regression models, using multiple variable selection methodologies to identify predictors of high risk of ascending aortic dilatation (ascending aorta with a diameter above 40 mm). Analyses included comprehensive multidimensional data (i.e., valve morphology, clinical data, family history of cardiovascular diseases, prevalent diseases, demographic, lifestyle and medication). Results: BAV patients were younger (60.4±12.4 years) than TAV patients (70.4±9.1 years), and had a higher frequency of aortic dilatation (45.3% vs. 28.9% for BAV and TAV, respectively. P<0.001). The aneurysm prediction models showed mean AUC values above 0.8 for TAV patients, with the absence of aortic stenosis being the main predictor, followed by diabetes and high sensitivity C-Reactive Protein. Using the same clinical measures in BAV patients our prediction model resulted in AUC values between 0.5-0.55, not useful for prediction of aortic dilatation. The classification results were consistent for all machine learning algorithms and classic logistic regression models. Conclusions: Cardiovascular risk profiles appear to be more predictive of aortopathy in TAV patients than in patients with BAV. This adds evidence to the fact that BAV- and TAV-associated aortopathy involve different pathways to aneurysm formation and highlights the need for specific aneurysm preventions in these patients. Further, our results highlight that machine learning approaches do not outperform classical prediction methods in addressing complex interactions and non-linear relations between variables.

Download Full-text

A Novel Machine Learning Based in silico Pathogenicity Predictor for Missense Variants in a Hematological Setting

Blood ◽

10.1182/blood-2019-128488 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 2090-2090 ◽

Cited By ~ 1

Author(s):

Stephan Hutter ◽

Constance Baer ◽

Wencke Walter ◽

Wolfgang Kern ◽

Claudia Haferlach ◽

...

Keyword(s):

Machine Learning ◽

In Silico ◽

Cross Validation ◽

Training Set ◽

Employment Equity ◽

Pathogenic Potential ◽

Data Set ◽

Routine Diagnostics ◽

Germline Variants ◽

Equity Ownership

Background: Interpreting the pathogenic potential of an amino-acid changing single nucleotide variant (SNV) in a disease related gene can be challenging, especially for rare variants for which little or no information is available in clinical databases. In silico predictors, tools that predict the functional impact of an SNV algorithmically, can be useful in this scenario, and guidelines for variant interpretation recommend their inclusion in the interpretation process. Resources such as the dbNSFP database, which contains pre-calculated prediction scores for dozens of different algorithms, are readily available today. However, individual predictors rarely come to the same conclusion, and even for well-known disease causing SNVs results can be heterogeneous or even contradictory, which complicates their interpretation. Ensemble predictors such as REVEL, MetaLR/SVM or CADD combine the knowledge/information from multiple individual sources. These predictors use machine learning methods and training sets of pre-defined pathogenic and benign SNVs to integrate individual algorithms into a single, easy to interpret score. However, current training sets are based on pathogenic germline variants, which might cause these predictors to underperform when testing somatic variants. Aim: Development of HePPy (Hematological Predictor of Pathogenicity), an ensemble in silico predictor trained on somatic disease causing variants for use in a hematological setting. Methods: We followed the approach laid out by REVEL and used 10 in silico predictor scores and 4 phylogenetic conservation scores from the dbNSFP data base to train a random forest model. Our training set consisted of 371 unique missense SNVs from 61 hematologically relevant genes that were recurrently identified (in at least 10 patients) during routine diagnostics. All were consistently and unambiguously characterized by hematological experts as either a pathogenic somatic variant (n = 268) or a benign germline variant (n = 103) using a rigorous manual classification process within a data set of 69,879 cases studied between 2005 and 2018. Model accuracy was assessed by 10-fold cross-validation and further evaluated using a test data set consisting of 335 rare missense SNVs from routine diagnostics for which control germline material (buccal swabs, finger nail clippings) from the respective patients was available. Variants originating in the germline were expected to be mainly benign (n = 123), while somatic variants were considered pathogenic (n = 212). We compared the performance of this new tool to REVEL, MetaLR/SVM, CADD and the popular individual predictors SIFT and Polyphen2 by generating receiver operating characteristic (ROC) curves and calculating the area under the curve (AUC). Model implementation and analysis was performed using the R libraries "randomForest", "caret" and "pROC". Results: HePPy scores range from 0 (benign) to 1 (pathogenic) and cross-validation on the training set indicates a high accuracy of 0.968, which is also reflected by the clear separation in the distribution of obtained scores for benign and pathogenic training SNVs (see figure B). Application of the model to the test data set of rare SNVs shows that HePPy (AUC = 0.873) outperforms all other prediction tools in separating germline from somatic variants (see figure A). Surprisingly, both MetaLR (AUC = 0.717) and MetaSVM (AUC = 0.703) performed worse than the individual predictors SIFT (AUC = 0.794) and Polyphen2 (AUC = 0.821), while CADD (AUC = 0.831) and REVEL (AUC = 0.850) showed better performance. HePPy scores for somatic test variants were heavily skewed towards very high values (mean = 0.917). Germline variants had significantly lower scores (mean = 0.466), but their distribution was much more uniform than for somatic variants (see figure C). This suggests, to consider a significant proportion of the rare germline variants to have pathogenic potential. This is in line with the growing awareness of pathogenic germline variants and familial predisposition and emphasizes the importance of in silico predictions and other tools to replace the simple "tumor vs. normal" comparison. Summary: We developed HePPy, a new in silico ensemble predictor that is trained on 371 well-defined hematopathological somatic missense variants, which outperforms other currently available methods for in silico prediction in a hematological setting. Figure Disclosures Hutter: MLL Munich Leukemia Laboratory: Employment. Baer:MLL Munich Leukemia Laboratory: Employment. Walter:MLL Munich Leukemia Laboratory: Employment. Kern:MLL Munich Leukemia Laboratory: Employment, Equity Ownership. Haferlach:MLL Munich Leukemia Laboratory: Employment, Equity Ownership. Haferlach:MLL Munich Leukemia Laboratory: Employment, Equity Ownership.

Download Full-text

The Mathematical Analysis and Classification Research of an Iris Data Set Using Binary Tree and Grey Relation Grade

Journal of Physics Conference Series ◽

10.1088/1742-6596/2068/1/012004 ◽

2021 ◽

Vol 2068 (1) ◽

pp. 012004

Author(s):

Chiang Ling Feng

Keyword(s):

Machine Learning ◽

Binary Tree ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Training Set ◽

Data Set ◽

Petal Length ◽

Sepal Length ◽

Grey Relation ◽

Iris Data

Abstract The data from an Iris flower database is studied. The Iris database is the most commonly used database for machine learning algorithms. The Iris database was developed by Ronald Aylmer Fisher in 1936. The Iris database has 150 records in three categories: Iris Sentosa, Iris Versicolor and Iris Virginic. The database has four attributes: sepal length, sepal width, petal length and petal width. For the machine learning algorithm, 150 Iris flower databases are used. Of the 150 Iris in the Iris database, 80% are used as the training set and the remaining 20% Iris as the test set. In machine learning, to perform classification and discrimination is a complicated and difficult thing. In this study, a grey relation grade is used to extract the main features of the Iris flower and a Binary Tree [1] is used to classify the Irises. The results show that for the same specific attributes, grey relation grade extracts the main attributes and can be used in combination with a binary for classification.

Download Full-text

Prediction of K562 Cells Functional Inhibitors Based on Machine Learning Approaches

Current Pharmaceutical Design ◽

10.2174/1381612825666191107092214 ◽

2020 ◽

Vol 25 (40) ◽

pp. 4296-4302 ◽

Cited By ~ 2

Author(s):

Yuan Zhang ◽

Zhenyan Han ◽

Qian Gao ◽

Xiaoyi Bai ◽

Chi Zhang ◽

...

Keyword(s):

Machine Learning ◽

Inclusion Bodies ◽

Cross Validation ◽

Independent Set ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Learning Approaches ◽

Validation Test ◽

Excess Number ◽

Fold Cross Validation

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.

Download Full-text