Assessing the Calibration in Toxicological in Vitro Models with Conformal Prediction

Abstract Machine learning methods are widely used in drug discovery and toxicity prediction. While showing overall good performance in cross-validation studies, their predictive power (often) drops in cases where the query samples have drifted from the training data’s descriptor space. Thus, the assumption for applying machine learning algorithms, that training and test data stem from the same distribution, might not always be fulfilled. In this work, conformal prediction is used to assess the calibration of the models. Deviations from the expected error may indicate that training and test data originate from different distributions. Exemplified on the Tox21 datasets, composed of chronologically released Tox21Train, Tox21Test and Tox21Score subsets, we observed that while internally valid models could be trained using cross-validation on Tox21Train, predictions on the external Tox21Score data resulted in higher error rates than expected. To improve the prediction on the external sets, a strategy exchanging the calibration set with more recent data, such as Tox21Test, has successfully been introduced. We conclude that conformal prediction can be used to diagnose data drifts and other issues relating to model calibration. The proposed improvement strategy — exchanging the calibration data only — is convenient as it does not require retraining of the underlying model.

Download Full-text

Assessing the calibration in toxicological in vitro models with conformal prediction

Journal of Cheminformatics ◽

10.1186/s13321-021-00511-5 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Andrea Morger ◽

Fredrik Svensson ◽

Staffan Arvidsson McShane ◽

Niharika Gauraha ◽

Ulf Norinder ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Cross Validation ◽

In Vitro Models ◽

Error Rates ◽

Machine Learning Algorithms ◽

Calibration Data ◽

Improvement Strategy ◽

Conformal Prediction

AbstractMachine learning methods are widely used in drug discovery and toxicity prediction. While showing overall good performance in cross-validation studies, their predictive power (often) drops in cases where the query samples have drifted from the training data’s descriptor space. Thus, the assumption for applying machine learning algorithms, that training and test data stem from the same distribution, might not always be fulfilled. In this work, conformal prediction is used to assess the calibration of the models. Deviations from the expected error may indicate that training and test data originate from different distributions. Exemplified on the Tox21 datasets, composed of chronologically released Tox21Train, Tox21Test and Tox21Score subsets, we observed that while internally valid models could be trained using cross-validation on Tox21Train, predictions on the external Tox21Score data resulted in higher error rates than expected. To improve the prediction on the external sets, a strategy exchanging the calibration set with more recent data, such as Tox21Test, has successfully been introduced. We conclude that conformal prediction can be used to diagnose data drifts and other issues related to model calibration. The proposed improvement strategy—exchanging the calibration data only—is convenient as it does not require retraining of the underlying model.

Download Full-text

Personalized prediction of live birth prior to the first in vitro fertilization treatment: a machine learning method

Journal of Translational Medicine ◽

10.1186/s12967-019-2062-5 ◽

2019 ◽

Vol 17 (1) ◽

Cited By ~ 10

Author(s):

Jiahui Qiu ◽

Pingping Li ◽

Meng Dong ◽

Xing Xin ◽

Jichun Tan

Keyword(s):

Machine Learning ◽

In Vitro Fertilization ◽

Prediction Model ◽

Live Birth ◽

Cross Validation ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Validation Dataset ◽

Vitro Fertilization

Abstract Background Infertility has become a global health issue with the number of couples seeking in vitro fertilization (IVF) worldwide continuing to rise. Some couples remain childless after several IVF cycles. Women undergoing IVF face greater risks and financial burden. A prediction model to predict the live birth chance prior to the first IVF treatment is needed in clinical practice for patients counselling and shaping expectations. Methods Clinical data of 7188 women who underwent their first IVF treatment at the Reproductive Medical Center of Shengjing Hospital of China Medical University during 2014–2018 were retrospectively collected. Machine-learning based models were developed on 70% of the dataset using pre-treatment variables, and prediction performances were evaluated on the remaining 30% using receiver operating characteristic (ROC) analysis and calibration plot. Nested cross-validation was used to make an unbiased estimate of the generalization performance of the machine learning algorithms. Results The XGBoost model achieved an area under the ROC curve of 0.73 on the validation dataset and showed the best calibration compared with other machine learning algorithms. Nested cross-validation resulted in an average accuracy score of 0.70 ± 0.003 for the XGBoost model. Conclusions A prediction model based on XGBoost was developed using age, AMH, BMI, duration of infertility, previous live birth, previous miscarriage, previous abortion and type of infertility as predictors. This study might be a promising step to provide personalized estimates of the cumulative live birth chance of the first complete IVF cycle before treatment.

Download Full-text

Prediction of K562 Cells Functional Inhibitors Based on Machine Learning Approaches

Current Pharmaceutical Design ◽

10.2174/1381612825666191107092214 ◽

2020 ◽

Vol 25 (40) ◽

pp. 4296-4302 ◽

Cited By ~ 2

Author(s):

Yuan Zhang ◽

Zhenyan Han ◽

Qian Gao ◽

Xiaoyi Bai ◽

Chi Zhang ◽

...

Keyword(s):

Machine Learning ◽

Inclusion Bodies ◽

Cross Validation ◽

Independent Set ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Learning Approaches ◽

Validation Test ◽

Excess Number ◽

Fold Cross Validation

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.

Download Full-text

Computational Methods for Structure-to-Function Analysis of Diet-Derived Catechins-Mediated Targeting of In Vitro Vasculogenic Mimicry

Cancer Informatics ◽

10.1177/11769351211009229 ◽

2021 ◽

Vol 20 ◽

pp. 117693512110092

Author(s):

Abicumaran Uthamacumaran ◽

Narjara Gonzalez Suarez ◽

Abdoulaye Baniré Diallo ◽

Borhane Annabi

Keyword(s):

Machine Learning ◽

Cancer Cells ◽

Structural Changes ◽

Function Analysis ◽

Vasculogenic Mimicry ◽

Machine Learning Algorithms ◽

Emergent Behavior ◽

Molecular Signature ◽

Ovarian Cancer Cells

Background: Vasculogenic mimicry (VM) is an adaptive biological phenomenon wherein cancer cells spontaneously self-organize into 3-dimensional (3D) branching network structures. This emergent behavior is considered central in promoting an invasive, metastatic, and therapy resistance molecular signature to cancer cells. The quantitative analysis of such complex phenotypic systems could require the use of computational approaches including machine learning algorithms originating from complexity science. Procedures: In vitro 3D VM was performed with SKOV3 and ES2 ovarian cancer cells cultured on Matrigel. Diet-derived catechins disruption of VM was monitored at 24 hours with pictures taken with an inverted microscope. Three computational algorithms for complex feature extraction relevant for 3D VM, including 2D wavelet analysis, fractal dimension, and percolation clustering scores were assessed coupled with machine learning classifiers. Results: These algorithms demonstrated the structure-to-function galloyl moiety impact on VM for each of the gallated catechin tested, and shown applicable in quantifying the drug-mediated structural changes in VM processes. Conclusions: Our study provides evidence of how appropriate 3D VM compression and feature extractors coupled with classification/regression methods could be efficient to study in vitro drug-induced perturbation of complex processes. Such approaches could be exploited in the development and characterization of drugs targeting VM.

Download Full-text

Research on Parallel Support Vector Machine Based on Spark Big Data Platform

Scientific Programming ◽

10.1155/2021/7998417 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Yao Huimin

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Big Data ◽

Support Vector Machines ◽

Cross Validation ◽

Machine Learning Algorithms ◽

Support Vector ◽

Lambda Architecture ◽

Vector Machines ◽

Data Platform

With the development of cloud computing and distributed cluster technology, the concept of big data has been expanded and extended in terms of capacity and value, and machine learning technology has also received unprecedented attention in recent years. Traditional machine learning algorithms cannot solve the problem of effective parallelization, so a parallelization support vector machine based on Spark big data platform is proposed. Firstly, the big data platform is designed with Lambda architecture, which is divided into three layers: Batch Layer, Serving Layer, and Speed Layer. Secondly, in order to improve the training efficiency of support vector machines on large-scale data, when merging two support vector machines, the “special points” other than support vectors are considered, that is, the points where the nonsupport vectors in one subset violate the training results of the other subset, and a cross-validation merging algorithm is proposed. Then, a parallelized support vector machine based on cross-validation is proposed, and the parallelization process of the support vector machine is realized on the Spark platform. Finally, experiments on different datasets verify the effectiveness and stability of the proposed method. Experimental results show that the proposed parallelized support vector machine has outstanding performance in speed-up ratio, training time, and prediction accuracy.

Download Full-text

Computational Models Using Multiple Machine Learning Algorithms for Predicting Drug Hepatotoxicity with the DILIrank Dataset

10.20944/preprints202002.0178.v1 ◽

2020 ◽

Author(s):

Robert Ancuceanu ◽

Marilena Viorica Hovanet ◽

Adriana Iuliana Anghel ◽

Florentina Furtunescu ◽

Monica Neagu ◽

...

Keyword(s):

Machine Learning ◽

Liver Injury ◽

Computational Models ◽

Liver Toxicity ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Drug Induced ◽

Reference Drug ◽

Drug Induced Liver Injury

Drug induced liver injury (DILI) remains one of the challenges in the safety profile of both authorized drugs and candidate drugs and predicting hepatotoxicity from the chemical structure of a substance remains a challenge worth pursuing, being also coherent with the current tendency for replacing non-clinical tests with in vitro or in silico alternatives. In 2016 a group of researchers from FDA published an improved annotated list of drugs with respect to their DILI risk, constituting “the largest reference drug list ranked by the risk for developing drug-induced liver injury in humans”, DILIrank. This paper is one of the few attempting to predict liver toxicity using the DILIrank dataset. Molecular descriptors were computed with the Dragon 7.0 software, and a variety of feature selection and machine learning algorithms were implemented in the R computing environment. Nested (double) cross-validation was used to externally validate the models selected. A number of 78 models with reasonable performance have been selected and stacked through several approaches, including the building of multiple meta-models. The performance of the stacked models was slightly superior to other models published. The models were applied in a virtual screening exercise on over 100,000 compounds from the ZINC database and about 20% of them were predicted to be non-hepatotoxic.

Download Full-text

PREDICTION AND ANALYSIS OF GEOMECHANICAL PROPERTIES OF JIMUSAER SHALE USING A MACHINE LEARNING APPROACH

10.30632/spwla-2021-0089 ◽

2021 ◽

Author(s):

Lianteng Song ◽

◽

Zhonghua Liu ◽

Chaoliu Li ◽

Congqian Ning ◽

...

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Gamma Ray ◽

Short Term Memory ◽

Machine Learning Algorithms ◽

Training Data ◽

Sequential Data ◽

Log Data ◽

Geomechanical Properties ◽

Single Well

Geomechanical properties are essential for safe drilling, successful completion, and exploration of both conven-tional and unconventional reservoirs, e.g. deep shale gas and shale oil. Typically, these properties could be calcu-lated from sonic logs. However, in shale reservoirs, it is time-consuming and challenging to obtain reliable log-ging data due to borehole complexity and lacking of in-formation, which often results in log deficiency and high recovery cost of incomplete datasets. In this work, we propose the bidirectional long short-term memory (BiL-STM) which is a supervised neural network algorithm that has been widely used in sequential data-based pre-diction to estimate geomechanical parameters. The pre-diction from log data can be conducted from two differ-ent aspects. 1) Single-Well prediction, the log data from a single well is divided into training data and testing data for cross validation; 2) Cross-Well prediction, a group of wells from the same geographical region are divided into training set and testing set for cross validation, as well. The logs used in this work were collected from 11 wells from Jimusaer Shale, which includes gamma ray, bulk density, resistivity, and etc. We employed 5 vari-ous machine learning algorithms for comparison, among which BiLSTM showed the best performance with an R-squared of more than 90% and an RMSE of less than 10. The predicted results can be directly used to calcu-late geomechanical properties, of which accuracy is also improved in contrast to conventional methods.

Download Full-text

Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation

Computational Materials Science ◽

10.1016/j.commatsci.2019.109203 ◽

2020 ◽

Vol 171 ◽

pp. 109203 ◽

Cited By ~ 26

Author(s):

Zheng Xiong ◽

Yuxin Cui ◽

Zhonghao Liu ◽

Yong Zhao ◽

Ming Hu ◽

...

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Learning Algorithms ◽

Machine Learning Algorithms

Download Full-text

A Comparative Study of Machine Learning Algorithms in Predicting Severe Complications after Bariatric Surgery

Journal of Clinical Medicine ◽

10.3390/jcm8050668 ◽

2019 ◽

Vol 8 (5) ◽

pp. 668 ◽

Cited By ~ 17

Author(s):

Yang Cao ◽

Xin Fang ◽

Johan Ottosson ◽

Erik Näslund ◽

Erik Stenberg

Keyword(s):

Machine Learning ◽

Bariatric Surgery ◽

Test Data ◽

Imbalanced Data ◽

Preoperative Assessment ◽

High Accuracy ◽

Machine Learning Algorithms ◽

Global Public Health ◽

Bariatric Surgical Procedure ◽

A Minor

Background: Severe obesity is a global public health threat of growing proportions. Accurate models to predict severe postoperative complications could be of value in the preoperative assessment of potential candidates for bariatric surgery. So far, traditional statistical methods have failed to produce high accuracy. We aimed to find a useful machine learning (ML) algorithm to predict the risk for severe complication after bariatric surgery. Methods: We trained and compared 29 supervised ML algorithms using information from 37,811 patients that operated with a bariatric surgical procedure between 2010 and 2014 in Sweden. The algorithms were then tested on 6250 patients operated in 2015. We performed the synthetic minority oversampling technique tackling the issue that only 3% of patients experienced severe complications. Results: Most of the ML algorithms showed high accuracy (>90%) and specificity (>90%) in both the training and test data. However, none of the algorithms achieved an acceptable sensitivity in the test data. We also tried to tune the hyperparameters of the algorithms to maximize sensitivity, but did not yet identify one with a high enough sensitivity that can be used in clinical praxis in bariatric surgery. However, a minor, but perceptible, improvement in deep neural network (NN) ML was found. Conclusion: In predicting the severe postoperative complication among the bariatric surgery patients, ensemble algorithms outperform base algorithms. When compared to other ML algorithms, deep NN has the potential to improve the accuracy and it deserves further investigation. The oversampling technique should be considered in the context of imbalanced data where the number of the interested outcome is relatively small.

Download Full-text

Computational Models Using Multiple Machine Learning Algorithms for Predicting Drug Hepatotoxicity with the DILIrank Dataset

International Journal of Molecular Sciences ◽

10.3390/ijms21062114 ◽

2020 ◽

Vol 21 (6) ◽

pp. 2114

Author(s):

Robert Ancuceanu ◽

Marilena Viorica Hovanet ◽

Adriana Iuliana Anghel ◽

Florentina Furtunescu ◽

Monica Neagu ◽

...

Keyword(s):

Machine Learning ◽

Liver Injury ◽

Computational Models ◽

Liver Toxicity ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Drug Induced ◽

Reference Drug ◽

Drug Induced Liver Injury

Drug-induced liver injury (DILI) remains one of the challenges in the safety profile of both authorized and candidate drugs, and predicting hepatotoxicity from the chemical structure of a substance remains a task worth pursuing. Such an approach is coherent with the current tendency for replacing non-clinical tests with in vitro or in silico alternatives. In 2016, a group of researchers from the FDA published an improved annotated list of drugs with respect to their DILI risk, constituting “the largest reference drug list ranked by the risk for developing drug-induced liver injury in humans” (DILIrank). This paper is one of the few attempting to predict liver toxicity using the DILIrank dataset. Molecular descriptors were computed with the Dragon 7.0 software, and a variety of feature selection and machine learning algorithms were implemented in the R computing environment. Nested (double) cross-validation was used to externally validate the models selected. A total of 78 models with reasonable performance were selected and stacked through several approaches, including the building of multiple meta-models. The performance of the stacked models was slightly superior to other models published. The models were applied in a virtual screening exercise on over 100,000 compounds from the ZINC database and about 20% of them were predicted to be non-hepatotoxic.

Download Full-text