Detecting the impact of subject characteristics on machine learning-based diagnostic applications

Elias Chaibub Neto; Abhishek Pratap; Thanneer M. Perumal; Meghasyam Tummalacherla; Phil Snyder; Brian M. Bot; Andrew D. Trister; Stephen H. Friend; Lara Mangravite; Larsson Omberg

doi:10.1038/s41746-019-0178-x

Detecting the impact of subject characteristics on machine learning-based diagnostic applications

npj Digital Medicine ◽

10.1038/s41746-019-0178-x ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 5

Author(s):

Elias Chaibub Neto ◽

Abhishek Pratap ◽

Thanneer M. Perumal ◽

Meghasyam Tummalacherla ◽

Phil Snyder ◽

...

Keyword(s):

Machine Learning ◽

Repeated Measures ◽

Digital Health ◽

Predictive Performance ◽

Repeated Measurements ◽

Health Tracking ◽

Health Studies ◽

Test Sets ◽

Diagnostic Applications ◽

The Impact

Abstract Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets (“record-wise” data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of “identity confounding.” In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.

Download Full-text

The Influence of Inhomogeneous Input Data from Different Waves on Predictive Model Development for COVID-19 ICU Patients (Preprint)

10.2196/preprints.31539 ◽

2021 ◽

Author(s):

Sebastian Johannes Fritsch ◽

Konstantin Sharafutdinov ◽

Moein Einollahzadeh Samadi ◽

Gernot Marx ◽

Andreas Schuppert ◽

...

Keyword(s):

Machine Learning ◽

Convex Hull ◽

Prediction Models ◽

Model Development ◽

Predictive Performance ◽

Support Vector ◽

Good Prediction ◽

The Impact ◽

Second Wave ◽

Over Time

BACKGROUND During the course of the COVID-19 pandemic, a variety of machine learning models were developed to predict different aspects of the disease, such as long-term causes, organ dysfunction or ICU mortality. The number of training datasets used has increased significantly over time. However, these data now come from different waves of the pandemic, not always addressing the same therapeutic approaches over time as well as changing outcomes between two waves. The impact of these changes on model development has not yet been studied. OBJECTIVE The aim of the investigation was to examine the predictive performance of several models trained with data from one wave predicting the second wave´s data and the impact of a pooling of these data sets. Finally, a method for comparison of different datasets for heterogeneity is introduced. METHODS We used two datasets from wave one and two to develop several predictive models for mortality of the patients. Four classification algorithms were used: logistic regression (LR), support vector machine (SVM), random forest classifier (RF) and AdaBoost classifier (ADA). We also performed a mutual prediction on the data of that wave which was not used for training. Then, we compared the performance of models when a pooled dataset from two waves was used. The populations from the different waves were checked for heterogeneity using a convex hull analysis. RESULTS 63 patients from wave one (03-06/2020) and 54 from wave two (08/2020-01/2021) were evaluated. For both waves separately, we found models reaching sufficient accuracies up to 0.79 AUROC (95%-CI 0.76-0.81) for SVM on the first wave and up 0.88 AUROC (95%-CI 0.86-0.89) for RF on the second wave. After the pooling of the data, the AUROC decreased relevantly. In the mutual prediction, models trained on second wave´s data showed, when applied on first wave´s data, a good prediction for non-survivors but an insufficient classification for survivors. The opposite situation (training: first wave, test: second wave) revealed the inverse behaviour with models correctly classifying survivors and incorrectly predicting non-survivors. The convex hull analysis for the first and second wave populations showed a more inhomogeneous distribution of underlying data when compared to randomly selected sets of patients of the same size. CONCLUSIONS Our work demonstrates that a larger dataset is not a universal solution to all machine learning problems in clinical settings. Rather, it shows that inhomogeneous data used to develop models can lead to serious problems. With the convex hull analysis, we offer a solution for this problem. The outcome of such an analysis can raise concerns if the pooling of different datasets would cause inhomogeneous patterns preventing a better predictive performance.

Download Full-text

Predictive modelling of hospital readmission: Evaluation of different preprocessing techniques on machine learning classifiers

Intelligent Data Analysis ◽

10.3233/ida-205468 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1073-1098

Author(s):

Nor Hamizah Miswan ◽

Chee Seng Chan ◽

Chong Guan Ng

Keyword(s):

Machine Learning ◽

Hospital Readmission ◽

Performance Metrics ◽

Predictive Performance ◽

Predictive Modelling ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

Set Up ◽

The Right ◽

The Impact

Hospital readmission is a major cost for healthcare systems worldwide. If patients with a higher potential of readmission could be identified at the start, existing resources could be used more efficiently, and appropriate plans could be implemented to reduce the risk of readmission. Therefore, it is important to predict the right target patients. Medical data is usually noisy, incomplete, and inconsistent. Hence, before developing a prediction model, it is crucial to efficiently set up the predictive model so that improved predictive performance is achieved. The current study aims to analyse the impact of different preprocessing methods on the performance of different machine learning classifiers. The preprocessing applied by previous hospital readmission studies were compared, and the most common approaches highlighted such as missing value imputation, feature selection, data balancing, and feature scaling. The hyperparameters were selected using Bayesian optimisation. The different preprocessing pipelines were assessed using various performance metrics and computational costs. The results indicated that the preprocessing approaches helped improve the model’s prediction of hospital readmission.

Download Full-text

Comparison of Machine Learning Algorithms in the Interpolation and Extrapolation of Flame Describing Functions

Volume 4B: Combustion, Fuels, and Emissions ◽

10.1115/gt2019-91319 ◽

2019 ◽

Author(s):

Michael McCartney ◽

Matthias Haeringer ◽

Wolfgang Polifke

Keyword(s):

Machine Learning ◽

Gaussian Processes ◽

Spline Interpolation ◽

Learning Algorithms ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Test Time ◽

Minimal Amount ◽

Data Points ◽

The Impact

Abstract This paper examines and compares commonly used Machine Learning algorithms in their performance in interpolation and extrapolation of FDFs, based on experimental and simulation data. Algorithm performance is evaluated by interpolating and extrapolating FDFs and then the impact of errors on the limit cycle amplitudes are evaluated using the xFDF framework. The best algorithms in interpolation and extrapolation were found to be the widely used cubic spline interpolation, as well as the Gaussian Processes regressor. The data itself was found to be an important factor in defining the predictive performance of a model, therefore a method of optimally selecting data points at test time using Gaussian Processes was demonstrated. The aim of this is to allow a minimal amount of data points to be collected while still providing enough information to model the FDF accurately. The extrapolation performance was shown to decay very quickly with distance from the domain and so emphasis should be put on selecting measurement points in order to expand the covered domain. Gaussian Processes also give an indication of confidence on its predictions and is used to carry out uncertainty quantification, in order to understand model sensitivities. This was demonstrated through application to the xFDF framework.

Download Full-text

The impact of data-complexity and team characteristics on performance in the classification model

International Journal of Business Analytics ◽

10.4018/ijban.288517 ◽

2022 ◽

Vol 9 (1) ◽

pp. 0-0

Keyword(s):

Machine Learning ◽

Binary Classification ◽

Class Imbalance ◽

Predictive Ability ◽

Predictive Performance ◽

Classification Model ◽

Data Complexity ◽

High Performing ◽

Machine Learning Model ◽

The Impact

This article investigates the impact of data-complexity and team-specific characteristics on machine learning competition scores. Data from five real-world binary classification competitions hosted on Kaggle.com were analyzed. The data-complexity characteristics were measured in four aspects including standard measures, sparsity measures, class imbalance measures, and feature-based measures. The results showed that the higher the level of the data-complexity characteristics was, the lower the predictive ability of the machine learning model was as well. Our empirical evidence revealed that the imbalance ratio of the target variable was the most important factor and exhibited a nonlinear relationship with the model’s predictive abilities. The imbalance ratio adversely affected the predictive performance when it reached a certain level. However, mixed results were found for the impact of team-specific characteristics measured by team size, team expertise, and the number of submissions on team performance. For high-performing teams, these factors had no impact on team score.

Download Full-text

Doing more with less

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476262 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2059-2072

Author(s):

Fatjon Zogaj ◽

José Pablo Cambronero ◽

Martin C. Rinard ◽

Jürgen Cito

Keyword(s):

Machine Learning ◽

Execution Time ◽

Time Budget ◽

Empirical Evaluation ◽

Predictive Performance ◽

Performance Metric ◽

Automated Machine Learning ◽

Classification Tasks ◽

The Impact ◽

User Intervention

Automated machine learning (AutoML) promises to democratize machine learning by automatically generating machine learning pipelines with little to no user intervention. Typically, a search procedure is used to repeatedly generate and validate candidate pipelines, maximizing a predictive performance metric, subject to a limited execution time budget. While this approach to generating candidates works well for small tabular datasets, the same procedure does not directly scale to larger tabular datasets with 100,000s of observations, often producing fewer candidate pipelines and yielding lower performance, given the same execution time budget. We carry out an extensive empirical evaluation of the impact that downsampling - reducing the number of rows in the input tabular dataset - has on the pipelines produced by a genetic-programming-based AutoML search for classification tasks.

Download Full-text

The Impact of Undersampling on the Predictive Performance of Logistic Regression and Machine Learning Algorithms

Epidemiology ◽

10.1097/ede.0000000000001198 ◽

2020 ◽

Vol 31 (5) ◽

pp. e42-e44

Author(s):

Abigail R. Cartus ◽

Lisa M. Bodnar ◽

Ashley I. Naimi

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Learning Algorithms ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

The Impact

Download Full-text

622Low birthweight prediction is not improved by repeated measures of gestational weight: the BOSHI study

International Journal of Epidemiology ◽

10.1093/ije/dyab168.499 ◽

2021 ◽

Vol 50 (Supplement_1) ◽

Author(s):

Mari S Oba ◽

Yoshitaka Murakami ◽

Michihiro Sato ◽

Takahisa Murakami ◽

Noriyuki Iwama ◽

...

Keyword(s):

Body Mass Index ◽

Weight Gain ◽

Gestational Weight Gain ◽

Body Mass ◽

Repeated Measures ◽

Prediction Models ◽

Perinatal Outcomes ◽

Predictive Performance ◽

Repeated Measurements ◽

Gestational Weight

Abstract Background Both pre-pregnancy body mass index and total weight gain during pregnancy are known risk factors for perinatal outcomes. However, little is known if repeated measurements of gestational weight gain can be utilized in the prediction of perinatal outcomes. We examined whether repeated measures of gestational weight improve the prediction of low infant birthweight. Methods Using data from the BOSHI study, we developed prediction models with low infant birthweight (<2500 g) as the outcome and gestational weight gain as the exposure of interest. A prediction model (Model 1) using only baseline values (pre-pregnancy body mass index) as the exposure was compared with a model using baseline and the last weight measurement (Model 2) and a model using baseline and trimester-specific measurements (Model 3). Model performance was assessed using c-statistics, Brier scores, and calibration plots. Results Among women who experienced full-term deliveries and had measured weights, the proportion of low infant birthweights was 5%. The c-statistics of Model 1, Model 2, and Model 3 were 0.78, 0.81, and 0.83, respectively. Other assessments were relatively unchanged. The extent of predictive performance improvement depends not only on the exposure-outcome associations but correlations among exposure measurements. Conclusions The inclusion of repeated gestational weight measurements in a model for predicting low infant birthweight only produced a marginal improvement in predictive performance. Key messages The prediction of low infant birthweight is not substantially improved by using repeated measurements of gestational weight.

Download Full-text

Comparison of Machine Learning Algorithms in the Interpolation and Extrapolation of Flame Describing Functions

Journal of Engineering for Gas Turbines and Power ◽

10.1115/1.4045516 ◽

2020 ◽

Vol 142 (6) ◽

Author(s):

Michael McCartney ◽

Matthias Haeringer ◽

Wolfgang Polifke

Keyword(s):

Machine Learning ◽

Gaussian Processes ◽

Spline Interpolation ◽

Learning Algorithms ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Test Time ◽

Minimal Amount ◽

Data Points ◽

The Impact

Abstract This paper examines and compares the commonly used machine learning algorithms in their performance in interpolation and extrapolation of flame describing function (FDFs), based on experimental and simulation data. Algorithm performance is evaluated by interpolating and extrapolating FDFs and then the impact of errors on the limit cycle amplitudes are evaluated using the extended FDF (xFDF) framework. The best algorithms in interpolation and extrapolation were found to be the widely used cubic spline interpolation, as well as the Gaussian processes (GPs) regressor. The data itself were found to be an important factor in defining the predictive performance of a model; therefore, a method of optimally selecting data points at test time using Gaussian processes was demonstrated. The aim of this is to allow a minimal amount of data points to be collected while still providing enough information to model the FDF accurately. The extrapolation performance was shown to decay very quickly with distance from the domain and so emphasis should be put on selecting measurement points in order to expand the covered domain. Gaussian processes also give an indication of confidence on its predictions and are used to carry out uncertainty quantification, in order to understand model sensitivities. This was demonstrated through application to the xFDF framework.

Download Full-text

The impact of recency and adequacy of historical information on sepsis predictions using machine learning

Scientific Reports ◽

10.1038/s41598-021-00220-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Manaf Zargoush ◽

Alireza Sameh ◽

Mahdi Javadi ◽

Siyavash Shabani ◽

Somayeh Ghazalbash ◽

...

Keyword(s):

Machine Learning ◽

Short Term Memory ◽

Predictive Performance ◽

Health Concern ◽

Observation Window ◽

Historical Information ◽

Risk Of Death ◽

Prediction Horizon ◽

The Impact

AbstractSepsis is a major public and global health concern. Every hour of delay in detecting sepsis significantly increases the risk of death, highlighting the importance of accurately predicting sepsis in a timely manner. A growing body of literature has examined developing new or improving the existing machine learning (ML) approaches for timely and accurate predictions of sepsis. This study contributes to this literature by providing clear insights regarding the role of the recency and adequacy of historical information in predicting sepsis using ML. To this end, we implemented a deep learning model using a bidirectional long short-term memory (BiLSTM) algorithm and compared it with six other ML algorithms based on numerous combinations of the prediction horizons (to capture information recency) and observation windows (to capture information adequacy) using different measures of predictive performance. Our results indicated that the BiLSTM algorithm outperforms all other ML algorithms and provides a great separability of the predicted risk of sepsis among septic versus non-septic patients. Moreover, decreasing the prediction horizon (in favor of information recency) always boosts the predictive performance; however, the impact of expanding the observation window (in favor of information adequacy) depends on the prediction horizon and the purpose of prediction. More specifically, when the prediction is responsive to the positive label (i.e., Sepsis), increasing historical data improves the predictive performance when the prediction horizon is short-moderate.

Download Full-text

Predictive models for cochlear implant outcomes: Performance, generalizability, and the impact of cohort size

Trends in Hearing ◽

10.1177/23312165211066174 ◽

2021 ◽

Vol 25 ◽

pp. 233121652110661

Author(s):

Elaheh Shafieibavani ◽

Benjamin Goudey ◽

Isabell Kiral ◽

Peter Zhong ◽

Antonio Jimeno-Yepes ◽

...

Keyword(s):

Machine Learning ◽

Cochlear Implant ◽

Sample Size ◽

Clinical Decision Making ◽

Linear Models ◽

Predictive Accuracy ◽

Predictive Performance ◽

Absolute Error ◽

Learning Approaches ◽

The Impact

While cochlear implants have helped hundreds of thousands of individuals, it remains difficult to predict the extent to which an individual’s hearing will benefit from implantation. Several publications indicate that machine learning may improve predictive accuracy of cochlear implant outcomes compared to classical statistical methods. However, existing studies are limited in terms of model validation and evaluating factors like sample size on predictive performance. We conduct a thorough examination of machine learning approaches to predict word recognition scores (WRS) measured approximately 12 months after implantation in adults with post-lingual hearing loss. This is the largest retrospective study of cochlear implant outcomes to date, evaluating 2,489 cochlear implant recipients from three clinics. We demonstrate that while machine learning models significantly outperform linear models in prediction of WRS, their overall accuracy remains limited (mean absolute error: 17.9-21.8). The models are robust across clinical cohorts, with predictive error increasing by at most 16% when evaluated on a clinic excluded from the training set. We show that predictive improvement is unlikely to be improved by increasing sample size alone, with doubling of sample size estimated to only increasing performance by 3% on the combined dataset. Finally, we demonstrate how the current models could support clinical decision making, highlighting that subsets of individuals can be identified that have a 94% chance of improving WRS by at least 10% points after implantation, which is likely to be clinically meaningful. We discuss several implications of this analysis, focusing on the need to improve and standardize data collection.

Download Full-text