Quantitative Interpretation Explains Machine Learning Models for Chemical Reaction Prediction and Uncovers Bias

<div><div><div><p>Organic synthesis remains a stumbling block in drug discovery. Although a plethora of machine learning models have been proposed as solutions in the literature, they suffer from being opaque black-boxes. It is neither clear if the models are making correct predictions because they inferred the salient chemistry, nor is it clear which training data they are relying on to reach a prediction. This opaqueness hinders both model developers and users. In this paper, we quantitatively interpret the Molecular Transformer, the state-of-the-art model for reaction prediction. We develop a framework to attribute predicted reaction outcomes both to specific parts of reactants, and to reactions in the training set. Furthermore, we demonstrate how to retrieve evidence for predicted reaction outcomes, and understand counterintuitive predictions by scrutinising the data. Additionally, we identify ”Clever Hans” predictions where the correct prediction is reached for the wrong reason due to dataset bias. We present a new debiased dataset that provides a more realistic assessment of model performance, which we propose as the new standard benchmark for comparing reaction prediction models.</p></div></div></div>

Download Full-text

Quantitative interpretation explains machine learning models for chemical reaction prediction and uncovers bias

Nature Communications ◽

10.1038/s41467-021-21895-w ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Dávid Péter Kovács ◽

William McCorkindale ◽

Alpha A. Lee

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Model Performance ◽

Training Data ◽

Correct Prediction ◽

Learning Models ◽

Reaction Prediction ◽

Wrong Reason ◽

Realistic Assessment ◽

Machine Learning Models

AbstractOrganic synthesis remains a major challenge in drug discovery. Although a plethora of machine learning models have been proposed as solutions in the literature, they suffer from being opaque black-boxes. It is neither clear if the models are making correct predictions because they inferred the salient chemistry, nor is it clear which training data they are relying on to reach a prediction. This opaqueness hinders both model developers and users. In this paper, we quantitatively interpret the Molecular Transformer, the state-of-the-art model for reaction prediction. We develop a framework to attribute predicted reaction outcomes both to specific parts of reactants, and to reactions in the training set. Furthermore, we demonstrate how to retrieve evidence for predicted reaction outcomes, and understand counterintuitive predictions by scrutinising the data. Additionally, we identify Clever Hans predictions where the correct prediction is reached for the wrong reason due to dataset bias. We present a new debiased dataset that provides a more realistic assessment of model performance, which we propose as the new standard benchmark for comparing reaction prediction models.

Download Full-text

Quantitative Interpretation Explains Machine Learning Models for Chemical Reaction Prediction and Uncovers Bias

10.26434/chemrxiv.13061402.v1 ◽

2020 ◽

Author(s):

David Peter Kovacs ◽

William McCorkindale ◽

Alpha Lee

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Model Performance ◽

Training Data ◽

Correct Prediction ◽

Learning Models ◽

Reaction Prediction ◽

Wrong Reason ◽

Realistic Assessment ◽

Machine Learning Models

Download Full-text

Assessing Continuous Operator Workload With a Hybrid Scaffolded Neuroergonomic Modeling Approach

Human Factors The Journal of the Human Factors and Ergonomics Society ◽

10.1177/0018720816672308 ◽

2017 ◽

Vol 59 (1) ◽

pp. 134-146 ◽

Cited By ~ 8

Author(s):

Brett J. Borghetti ◽

Joseph J. Giametta ◽

Christina F. Rusnock

Keyword(s):

Machine Learning ◽

Adaptive Systems ◽

Model Performance ◽

Machine Learning Algorithms ◽

Training Data ◽

State Assessments ◽

Learning Models ◽

Dynamic Task ◽

Operator Workload ◽

Machine Learning Models

Objective: We aimed to predict operator workload from neurological data using statistical learning methods to fit neurological-to-state-assessment models. Background: Adaptive systems require real-time mental workload assessment to perform dynamic task allocations or operator augmentation as workload issues arise. Neuroergonomic measures have great potential for informing adaptive systems, and we combine these measures with models of task demand as well as information about critical events and performance to clarify the inherent ambiguity of interpretation. Method: We use machine learning algorithms on electroencephalogram (EEG) input to infer operator workload based upon Improved Performance Research Integration Tool workload model estimates. Results: Cross-participant models predict workload of other participants, statistically distinguishing between 62% of the workload changes. Machine learning models trained from Monte Carlo resampled workload profiles can be used in place of deterministic workload profiles for cross-participant modeling without incurring a significant decrease in machine learning model performance, suggesting that stochastic models can be used when limited training data are available. Conclusion: We employed a novel temporary scaffold of simulation-generated workload profile truth data during the model-fitting process. A continuous workload profile serves as the target to train our statistical machine learning models. Once trained, the workload profile scaffolding is removed and the trained model is used directly on neurophysiological data in future operator state assessments. Application: These modeling techniques demonstrate how to use neuroergonomic methods to develop operator state assessments, which can be employed in adaptive systems.

Download Full-text

Effectiveness, Explainability and Reliability of Machine Meta-Learning Methods for Predicting Mortality in Patients with COVID-19: Results of the Brazilian COVID-19 Registry

10.1101/2021.11.01.21265527 ◽

2021 ◽

Author(s):

Bruno Barbosa Miranda de Paiva ◽

Polianna Delfino Pereira ◽

Claudio Moises Valiense de Andrade ◽

Virginia Mara Reis Gomes ◽

Maria Clara Pontello Barbosa Lima ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

State Of The Art ◽

Laboratory Data ◽

Machine Learning Algorithms ◽

Training Data ◽

Learning Models ◽

Learning Methods ◽

Meta Learning ◽

Machine Learning Models

Objective: To provide a thorough comparative study among state ofthe art machine learning methods and statistical methods for determining in-hospital mortality in COVID 19 patients using data upon hospital admission; to study the reliability of the predictions of the most effective methods by correlating the probability of the outcome and the accuracy of the methods; to investigate how explainable are the predictions produced by the most effective methods. Materials and Methods: De-identified data were obtained from COVID 19 positive patients in 36 participating hospitals, from March 1 to September 30, 2020. Demographic, comorbidity, clinical presentation and laboratory data were used as training data to develop COVID 19 mortality prediction models. Multiple machine learning and traditional statistics models were trained on this prediction task using a folded cross validation procedure, from which we assessed performance and interpretability metrics. Results: The Stacking of machine learning models improved over the previous state of the art results by more than 26% in predicting the class of interest (death), achieving 87.1% of AUROC and macroF1 of 73.9%. We also show that some machine learning models can be very interpretable and reliable, yielding more accurate predictions while providing a good explanation for the why. Conclusion: The best results were obtained using the meta learning ensemble model Stacking. State of the art explainability techniques such as SHAP values can be used to draw useful insights into the patterns learned by machine-learning algorithms. Machine learning models can be more explainable than traditional statistics models while also yielding highly reliable predictions. Key words: COVID-19; prognosis; prediction model; machine learning

Download Full-text

On Missingness Features in Machine Learning Models for Critical Care: Observational Study (Preprint)

10.2196/preprints.25022 ◽

2020 ◽

Author(s):

Janmajay Singh ◽

Masahiro Sato ◽

Tomoko Ohkuma

Keyword(s):

Machine Learning ◽

Length Of Stay ◽

Electronic Health Records ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

BACKGROUND Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. OBJECTIVE The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. METHODS A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. RESULTS Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. CONCLUSIONS This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.

Download Full-text

Three-level Sleep Stage Classification Based on Wrist-worn Accelerometry Data Alone

10.1101/2021.08.10.455812 ◽

2021 ◽

Author(s):

Jian Hu ◽

Haochang Shou

Keyword(s):

Machine Learning ◽

Sequential Analysis ◽

Short Term Memory ◽

Sleep Stage ◽

Model Performance ◽

Training Data ◽

Sleep Stages ◽

Learning Models ◽

Local Variability ◽

Machine Learning Models

Objective: The use of wearable sensor devices on daily basis to track real-time movements during wake and sleep has provided opportunities for automatic sleep quantification using such data. Existing algorithms for classifying sleep stages often require large training data and multiple input signals including heart rate and respiratory data. We aimed to examine the capability of classifying sleep stages using sensible features directly from accelerometers only with the aid of advanced recurrent neural networks. Materials and Methods: We analyzed a publicly available dataset with accelerometry data in 5s epoch length and polysomnography assessments. We developed long short-term memory (LSTM) models that take the 3-axis accelerations, angles, and temperatures from concurrent and historic observation windows to predict wake, REM and non-REM sleep. Leave-one-subject-out experiments were conducted to compare and evaluate the model performance with conventional nonsequential machine learning models using metrics such as multiclass training and testing accuracy, weighted precision, F1 score and area-under-the-curve (AUC). Results: Our sequential analysis framework outperforms traditional non-sequential models in all aspects of model evaluation metrics. We achieved an average of 65% and a maximum of 81% validation accuracy for classifying three sleep labels even with a relatively small training sample of clinical visitors. The presence of two additional derived variables, local variability and range, have shown to strongly improve the model performance. Discussion : Results indicate that it is crucial to account for deep temporal dependency and assess local variability of the features. The post-hoc analysis of individual model performances on subjects' demographic characteristics also suggest the need of including pathological samples in the training data in order to develop robust machine learning models that are capable of capturing normal and anomaly sleep patterns in the population.

Download Full-text

An Adaptive Deep Ensemble Learning Method for Dynamic Evolving Diagnostic Task Scenarios

Diagnostics ◽

10.3390/diagnostics11122288 ◽

2021 ◽

Vol 11 (12) ◽

pp. 2288

Author(s):

Kaixiang Su ◽

Jiao Wu ◽

Dongxiao Gu ◽

Shanlin Yang ◽

Shuyuan Deng ◽

...

Keyword(s):

Machine Learning ◽

Ensemble Learning ◽

Model Performance ◽

Optimal Number ◽

Training Data ◽

Learning Method ◽

Learning Models ◽

Proposed Model ◽

Public Datasets ◽

Machine Learning Models

Increasingly, machine learning methods have been applied to aid in diagnosis with good results. However, some complex models can confuse physicians because they are difficult to understand, while data differences across diagnostic tasks and institutions can cause model performance fluctuations. To address this challenge, we combined the Deep Ensemble Model (DEM) and tree-structured Parzen Estimator (TPE) and proposed an adaptive deep ensemble learning method (TPE-DEM) for dynamic evolving diagnostic task scenarios. Different from previous research that focuses on achieving better performance with a fixed structure model, our proposed model uses TPE to efficiently aggregate simple models more easily understood by physicians and require less training data. In addition, our proposed model can choose the optimal number of layers for the model and the type and number of basic learners to achieve the best performance in different diagnostic task scenarios based on the data distribution and characteristics of the current diagnostic task. We tested our model on one dataset constructed with a partner hospital and five UCI public datasets with different characteristics and volumes based on various diagnostic tasks. Our performance evaluation results show that our proposed model outperforms other baseline models on different datasets. Our study provides a novel approach for simple and understandable machine learning models in tasks with variable datasets and feature sets, and the findings have important implications for the application of machine learning models in computer-aided diagnosis.

Download Full-text

On Missingness Features in Machine Learning Models for Critical Care: Observational Study

JMIR Medical Informatics ◽

10.2196/25022 ◽

2021 ◽

Vol 9 (12) ◽

pp. e25022

Author(s):

Janmajay Singh ◽

Masahiro Sato ◽

Tomoko Ohkuma

Keyword(s):

Machine Learning ◽

Length Of Stay ◽

Electronic Health Records ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

Background Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. Objective The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. Methods A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. Results Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. Conclusions This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.

Download Full-text

Machine Learning-based Prediction Models for Diagnosis and Prognosis in Inflammatory Bowel Diseases: A Systematic Review

Journal of Crohn s and Colitis ◽

10.1093/ecco-jcc/jjab155 ◽

2021 ◽

Author(s):

Nghia H Nguyen ◽

Dominic Picetti ◽

Parambir S Dulai ◽

Vipul Jairath ◽

William J Sandborn ◽

...

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Risk Prediction ◽

Statistical Models ◽

Prediction Models ◽

Risk Of Bias ◽

Learning Models ◽

Bowel Diseases ◽

Inflammatory Bowel ◽

Machine Learning Models

Abstract Background and Aims There is increasing interest in machine learning-based prediction models in inflammatory bowel diseases (IBD). We synthesized and critically appraised studies comparing machine learning vs. traditional statistical models, using routinely available clinical data for risk prediction in IBD. Methods Through a systematic review till January 1, 2021, we identified cohort studies that derived and/or validated machine learning models, based on routinely collected clinical data in patients with IBD, to predict the risk of harboring or developing adverse clinical outcomes, and reported its predictive performance against a traditional statistical model for the same outcome. We appraised the risk of bias in these studies using the Prediction model Risk of Bias ASsessment (PROBAST) tool. Results We included 13 studies on machine learning-based prediction models in IBD encompassing themes of predicting treatment response to biologics and thiopurines, predicting longitudinal disease activity and complications and outcomes in patients with acute severe ulcerative colitis. The most common machine learnings models used were tree-based algorithms, which are classification approaches achieved through supervised learning. Machine learning models outperformed traditional statistical models in risk prediction. However, most models were at high risk of bias, and only one was externally validated. Conclusions Machine learning-based prediction models based on routinely collected data generally perform better than traditional statistical models in risk prediction in IBD, though frequently have high risk of bias. Future studies examining these approaches are warranted, with special focus on external validation and clinical applicability.

Download Full-text

Performance Metrics for the Comparative Analysis of Clinical Risk Prediction Models Employing Machine Learning

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.120.007526 ◽

2021 ◽

Author(s):

Chenxi Huang ◽

Shu-Xia Li ◽

César Caraballo ◽

Frederick A. Masoudi ◽

John S. Rumsfeld ◽

...

Keyword(s):

Machine Learning ◽

Risk Prediction ◽

Health Care Professionals ◽

Clinical Decision Making ◽

Performance Metrics ◽

Prediction Models ◽

Learning Models ◽

Risk Prediction Models ◽

Clinical Risk ◽

Machine Learning Models

Background: New methods such as machine learning techniques have been increasingly used to enhance the performance of risk predictions for clinical decision-making. However, commonly reported performance metrics may not be sufficient to capture the advantages of these newly proposed models for their adoption by health care professionals to improve care. Machine learning models often improve risk estimation for certain subpopulations that may be missed by these metrics. Methods and Results: This article addresses the limitations of commonly reported metrics for performance comparison and proposes additional metrics. Our discussions cover metrics related to overall performance, discrimination, calibration, resolution, reclassification, and model implementation. Models for predicting acute kidney injury after percutaneous coronary intervention are used to illustrate the use of these metrics. Conclusions: We demonstrate that commonly reported metrics may not have sufficient sensitivity to identify improvement of machine learning models and propose the use of a comprehensive list of performance metrics for reporting and comparing clinical risk prediction models.

Download Full-text