Lessons and tips for designing a machine learning study using EHR data

Abstract Machine learning (ML) provides the ability to examine massive datasets and uncover patterns within data without relying on a priori assumptions such as specific variable associations, linearity in relationships, or prespecified statistical interactions. However, the application of ML to healthcare data has been met with mixed results, especially when using administrative datasets such as the electronic health record. The black box nature of many ML algorithms contributes to an erroneous assumption that these algorithms can overcome major data issues inherent in large administrative healthcare data. As with other research endeavors, good data and analytic design is crucial to ML-based studies. In this paper, we will provide an overview of common misconceptions for ML, the corresponding truths, and suggestions for incorporating these methods into healthcare research while maintaining a sound study design.

Download Full-text

Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data

Proceedings of the Workshop on Human-In-the-Loop Data Analytics - HILDA'19 ◽

10.1145/3328519.3329126 ◽

2019 ◽

Author(s):

Sergey Redyuk ◽

Sebastian Schelter ◽

Tammo Rukat ◽

Volker Markl ◽

Felix Biessmann

Keyword(s):

Machine Learning ◽

Black Box ◽

Learning Models ◽

Unseen Data ◽

Machine Learning Models

Download Full-text

MODES: model-based optimization on distributed embedded systems

Machine Learning ◽

10.1007/s10994-021-06014-6 ◽

2021 ◽

Author(s):

Junjie Shi ◽

Jiang Bian ◽

Jakob Richter ◽

Kuan-Hsun Chen ◽

Jörg Rahnenführer ◽

...

Keyword(s):

Machine Learning ◽

Embedded Systems ◽

Learning Model ◽

Black Box ◽

Distributed Embedded Systems ◽

Data Set ◽

Individual Model ◽

Model Based ◽

Machine Learning Model ◽

Distributed Machine Learning

AbstractThe predictive performance of a machine learning model highly depends on the corresponding hyper-parameter setting. Hence, hyper-parameter tuning is often indispensable. Normally such tuning requires the dedicated machine learning model to be trained and evaluated on centralized data to obtain a performance estimate. However, in a distributed machine learning scenario, it is not always possible to collect all the data from all nodes due to privacy concerns or storage limitations. Moreover, if data has to be transferred through low bandwidth connections it reduces the time available for tuning. Model-Based Optimization (MBO) is one state-of-the-art method for tuning hyper-parameters but the application on distributed machine learning models or federated learning lacks research. This work proposes a framework $$\textit{MODES}$$ MODES that allows to deploy MBO on resource-constrained distributed embedded systems. Each node trains an individual model based on its local data. The goal is to optimize the combined prediction accuracy. The presented framework offers two optimization modes: (1) $$\textit{MODES}$$ MODES -B considers the whole ensemble as a single black box and optimizes the hyper-parameters of each individual model jointly, and (2) $$\textit{MODES}$$ MODES -I considers all models as clones of the same black box which allows it to efficiently parallelize the optimization in a distributed setting. We evaluate $$\textit{MODES}$$ MODES by conducting experiments on the optimization for the hyper-parameters of a random forest and a multi-layer perceptron. The experimental results demonstrate that, with an improvement in terms of mean accuracy ($$\textit{MODES}$$ MODES -B), run-time efficiency ($$\textit{MODES}$$ MODES -I), and statistical stability for both modes, $$\textit{MODES}$$ MODES outperforms the baseline, i.e., carry out tuning with MBO on each node individually with its local sub-data set.

Download Full-text

Predicting optimal deep brain stimulation parameters for Parkinson’s disease using functional MRI and machine learning

Nature Communications ◽

10.1038/s41467-021-23311-9 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Alexandre Boutet ◽

Radhika Madhavan ◽

Gavin J. B. Elias ◽

Suresh E. Joel ◽

Robert Gramer ◽

...

Keyword(s):

Machine Learning ◽

Parkinson’S Disease ◽

Parkinson's Disease ◽

Deep Brain Stimulation ◽

Brain Stimulation ◽

A Priori ◽

Brain Response ◽

Brain Responses ◽

Clinical Benefits ◽

Deep Brain

AbstractCommonly used for Parkinson’s disease (PD), deep brain stimulation (DBS) produces marked clinical benefits when optimized. However, assessing the large number of possible stimulation settings (i.e., programming) requires numerous clinic visits. Here, we examine whether functional magnetic resonance imaging (fMRI) can be used to predict optimal stimulation settings for individual patients. We analyze 3 T fMRI data prospectively acquired as part of an observational trial in 67 PD patients using optimal and non-optimal stimulation settings. Clinically optimal stimulation produces a characteristic fMRI brain response pattern marked by preferential engagement of the motor circuit. Then, we build a machine learning model predicting optimal vs. non-optimal settings using the fMRI patterns of 39 PD patients with a priori clinically optimized DBS (88% accuracy). The model predicts optimal stimulation settings in unseen datasets: a priori clinically optimized and stimulation-naïve PD patients. We propose that fMRI brain responses to DBS stimulation in PD patients could represent an objective biomarker of clinical response. Upon further validation with additional studies, these findings may open the door to functional imaging-assisted DBS programming.

Download Full-text

How to fool a black box machine learning based side-channel security evaluation

Cryptography and Communications ◽

10.1007/s12095-021-00479-x ◽

2021 ◽

Author(s):

Charles-Henry Bertrand Van Ouytsel ◽

Olivier Bronchain ◽

Gaëtan Cassiers ◽

François-Xavier Standaert

Keyword(s):

Machine Learning ◽

Black Box ◽

Security Evaluation ◽

Side Channel

Download Full-text

Machine Learning for Statistical Modeling

ACM Transactions on Design Automation of Electronic Systems ◽

10.1145/3440014 ◽

2021 ◽

Vol 26 (3) ◽

pp. 1-17

Author(s):

Urmimala Roy ◽

Tanmoy Pramanik ◽

Subhendu Roy ◽

Avhishek Chatterjee ◽

Leonard F. Register ◽

...

Keyword(s):

Machine Learning ◽

Time Distribution ◽

Switching Time ◽

A Priori ◽

Random Access ◽

Spin Transfer Torque ◽

Support Vector ◽

Micromagnetic Simulations ◽

Physical Systems ◽

Computational Resources

We propose a methodology to perform process variation-aware device and circuit design using fully physics-based simulations within limited computational resources, without developing a compact model. Machine learning (ML), specifically a support vector regression (SVR) model, has been used. The SVR model has been trained using a dataset of devices simulated a priori, and the accuracy of prediction by the trained SVR model has been demonstrated. To produce a switching time distribution from the trained ML model, we only had to generate the dataset to train and validate the model, which needed ∼500 hours of computation. On the other hand, if 10 6 samples were to be simulated using the same computation resources to generate a switching time distribution from micromagnetic simulations, it would have taken ∼250 days. Spin-transfer-torque random access memory (STTRAM) has been used to demonstrate the method. However, different physical systems may be considered, different ML models can be used for different physical systems and/or different device parameter sets, and similar ends could be achieved by training the ML model using measured device data.

Download Full-text

Prediction of activity and selectivity profiles of human Carbonic Anhydrase inhibitors using machine learning classification models

Journal of Cheminformatics ◽

10.1186/s13321-021-00499-y ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Annachiara Tinivella ◽

Luca Pinzi ◽

Giulio Rastelli

Keyword(s):

Machine Learning ◽

Carbonic Anhydrase ◽

A Priori ◽

Selective Inhibition ◽

Great Promise ◽

Classification Models ◽

Machine Learning Classification ◽

Central Interest ◽

Human Carbonic Anhydrase ◽

In Silico Models

AbstractThe development of selective inhibitors of the clinically relevant human Carbonic Anhydrase (hCA) isoforms IX and XII has become a major topic in drug research, due to their deregulation in several types of cancer. Indeed, the selective inhibition of these two isoforms, especially with respect to the homeostatic isoform II, holds great promise to develop anticancer drugs with limited side effects. Therefore, the development of in silico models able to predict the activity and selectivity against the desired isoform(s) is of central interest. In this work, we have developed a series of machine learning classification models, trained on high confidence data extracted from ChEMBL, able to predict the activity and selectivity profiles of ligands for human Carbonic Anhydrase isoforms II, IX and XII. The training datasets were built with a procedure that made use of flexible bioactivity thresholds to obtain well-balanced active and inactive classes. We used multiple algorithms and sampling sizes to finally select activity models able to classify active or inactive molecules with excellent performances. Remarkably, the results herein reported turned out to be better than those obtained by models built with the classic approach of selecting an a priori activity threshold. The sequential application of such validated models enables virtual screening to be performed in a fast and more reliable way to predict the activity and selectivity profiles against the investigated isoforms.

Download Full-text

Explainable AI: A Review of Machine Learning Interpretability Methods

Entropy ◽

10.3390/e23010018 ◽

2020 ◽

Vol 23 (1) ◽

pp. 18

Author(s):

Pantelis Linardatos ◽

Vasilis Papastefanopoulos ◽

Sotiris Kotsiantis

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Black Box ◽

Learning Systems ◽

Model Complexity ◽

Learning Models ◽

New Methods ◽

Industrial Adoption ◽

Machine Learning Models ◽

The Way

Recent advances in artificial intelligence (AI) have led to its widespread industrial adoption, with machine learning systems demonstrating superhuman performance in a significant number of tasks. However, this surge in performance, has often been achieved through increased model complexity, turning such systems into “black box” approaches and causing uncertainty regarding the way they operate and, ultimately, the way that they come to decisions. This ambiguity has made it problematic for machine learning systems to be adopted in sensitive yet critical domains, where their value could be immense, such as healthcare. As a result, scientific interest in the field of Explainable Artificial Intelligence (XAI), a field that is concerned with the development of new methods that explain and interpret machine learning models, has been tremendously reignited over recent years. This study focuses on machine learning interpretability methods; more specifically, a literature review and taxonomy of these methods are presented, as well as links to their programming implementations, in the hope that this survey would serve as a reference point for both theorists and practitioners.

Download Full-text

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn’t

Genes ◽

10.3390/genes12040527 ◽

2021 ◽

Vol 12 (4) ◽

pp. 527

Author(s):

Eran Elhaik ◽

Dan Graur

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

A Priori ◽

Neutral Theory ◽

Dominant Mode ◽

Supervised Machine Learning ◽

Training Dataset ◽

Selective Sweeps ◽

Two Factors ◽

Negative Controls

In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled “Soft sweeps are the dominant mode of adaptation in the human genome” (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863–1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366–1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern’s paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.

Download Full-text