Impact of train/test sample regimen on performance estimate stability of machine learning in cardiovascular imaging

AbstractAs machine learning research in the field of cardiovascular imaging continues to grow, obtaining reliable model performance estimates is critical to develop reliable baselines and compare different algorithms. While the machine learning community has generally accepted methods such as k-fold stratified cross-validation (CV) to be more rigorous than single split validation, the standard research practice in medical fields is the use of single split validation techniques. This is especially concerning given the relatively small sample sizes of datasets used for cardiovascular imaging. We aim to examine how train-test split variation impacts the stability of machine learning (ML) model performance estimates in several validation techniques on two real-world cardiovascular imaging datasets: stratified split-sample validation (70/30 and 50/50 train-test splits), tenfold stratified CV, 10 × repeated tenfold stratified CV, bootstrapping (500 × repeated), and leave one out (LOO) validation. We demonstrate that split validation methods lead to the highest range in AUC and statistically significant differences in ROC curves, unlike the other aforementioned approaches. When building predictive models on relatively small data sets as is often the case in medical imaging, split-sample validation techniques can produce instability in performance estimates with variations in range over 0.15 in the AUC values, and thus any of the alternate validation methods are recommended.

Download Full-text

LncLocation: Efficient Subcellular Location Prediction of Long Non-Coding RNA-Based Multi-Source Heterogeneous Feature Fusion

International Journal of Molecular Sciences ◽

10.3390/ijms21197271 ◽

2020 ◽

Vol 21 (19) ◽

pp. 7271

Author(s):

Shiyao Feng ◽

Yanchun Liang ◽

Wei Du ◽

Wei Lv ◽

Ying Li

Keyword(s):

Machine Learning ◽

Subcellular Location ◽

Small Sample ◽

Small Data ◽

Location Prediction ◽

Learning Models ◽

Significant Information ◽

Subcellular Location Prediction ◽

Multi Classification ◽

Machine Learning Models

Recent studies uncover that subcellular location of long non-coding RNAs (lncRNAs) can provide significant information on its function. Due to the lack of experimental data, the number of lncRNAs is very limited, experimentally verified subcellular localization, and the numbers of lncRNAs located in different organelle are wildly imbalanced. The prediction of subcellular location of lncRNAs is actually a multi-classification small sample imbalance problem. The imbalance of data results in the poor recognition effect of machine learning models on small data subsets, which is a puzzling and challenging problem in the existing research. In this study, we integrate multi-source features to construct a sequence-based computational tool, lncLocation, to predict the subcellular location of lncRNAs. Autoencoder is used to enhance part of the features, and the binomial distribution-based filtering method and recursive feature elimination (RFE) are used to filter some of the features. It improves the representation ability of data and reduces the problem of unbalanced multi-classification data. By comprehensive experiments on different feature combinations and machine learning models, we select the optimal features and classifier model scheme to construct a subcellular location prediction tool, lncLocation. LncLocation can obtain an 87.78% accuracy using 5-fold cross validation on the benchmark data, which is higher than the state-of-the-art tools, and the classification performance, especially for small class sets, is improved significantly.

Download Full-text

Radiomics machine learning study with a small sample size: single random training-test set split may result in unreliable results

10.21203/rs.3.rs-105766/v1 ◽

2020 ◽

Author(s):

Chansik An ◽

Yae Won Park ◽

Sung Soo Ahn ◽

Kyunghwa Han ◽

Hwiyoung Kim ◽

...

Keyword(s):

Machine Learning ◽

Standard Deviation ◽

Sample Size ◽

Model Performance ◽

Small Sample ◽

Operating Characteristics ◽

Simple Task ◽

Test Set ◽

Relative Standard ◽

Test Sets

Abstract Objective: To determine how the estimated performance of a machine learning model varies according to how a dataset is split into training and test sets using brain tumor radiomics data, under different conditions.Materials and Methods: Two binary tasks with different levels of difficulty ('simple’ task, glioblastoma [GBM, n=109] vs. brain metastasis [n=58]; 'difficult’ task, low- [n=163] vs. high grade [n=95] meningiomas) were performed using radiomics features from magnetic resonance imaging (MRI). For each trial of the 1,000 different training-test set splits with a ratio of 7:3, a least absolute shrinkage and selection operator (LASSO) model was trained by 5-fold cross-validation (CV) in the training set and tested in the test set. The model stability and performance was evaluated according to the number of input features (from 1 to 50), the sample size (full vs. undersampled), and the level of difficulty. In addition to 5-fold CV without a repetition, three other CV methods were compared: 5-fold CV with 100 repetitions, nested CV, and nested CV with 100 repetitions.Results: The highest mean cross-validated area under the receiver operating characteristics curve (AUC) and the higher stability (lower AUC differences between training and testing) was achieved with 6 and 13 features from the GBM and meningioma task, respectively. For the simple task, simple task with undersampling, difficult task, and difficult task with undersampling, average mean AUCs were 0.947, 0.923, 0.795, and 0.764, and average AUC differences between training and testing were 0.029, 0.054, 0.053, and 0.108, respectively. Among four CV models, the most conservative method (i.e., lowest AUC and highest relative standard deviation [RSD]) was nested CV with 100 repetitions.Conclusions: A single random split of a dataset into training and test sets may lead to an unreliable report of model performance in radiomics machine learning studies, and reporting the mean and standard deviation of model performance metrics by performing nested and/or repeated CV on the entire dataset is suggested.

Download Full-text

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

PLoS ONE ◽

10.1371/journal.pone.0256152 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0256152

Author(s):

Chansik An ◽

Yae Won Park ◽

Sung Soo Ahn ◽

Kyunghwa Han ◽

Hwiyoung Kim ◽

...

Keyword(s):

Machine Learning ◽

Test Performance ◽

Small Sample Size ◽

Area Under The Curve ◽

Small Sample ◽

Simple Task ◽

Test Set ◽

Validation Methods ◽

Small Sample Sizes ◽

Set Splitting

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) “Simple” task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) “difficult” task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.

Download Full-text

ORGANIC (1).pdf

10.26434/chemrxiv.5309668.v1 ◽

2017 ◽

Author(s):

Benjamin Sanchez-Lengeling ◽

Carlos Outeiral ◽

Gabriel L. Guimaraes ◽

Alan Aspuru-Guzik

Keyword(s):

Machine Learning ◽

Learning Community ◽

Chemical Species ◽

Material Design ◽

Organic Photovoltaic ◽

Generative Adversarial Networks ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Adversarial Networks ◽

Photovoltaic Material

Molecular discovery seeks to generate chemical species tailored to very specific needs. In this paper, we present ORGANIC, a framework based on Objective-Reinforced Generative Adversarial Networks (ORGAN), capable of producing a distribution over molecular space that matches with a certain set of desirable metrics. This methodology combines two successful techniques from the machine learning community: a Generative Adversarial Network (GAN), to create non-repetitive sensible molecular species, and Reinforcement Learning (RL), to bias this generative distribution towards certain attributes. We explore several applications, from optimization of random physicochemical properties to candidates for drug discovery and organic photovoltaic material design.

Download Full-text

Automatically evaluating balance using machine learning and data from a single inertial measurement unit

Journal of NeuroEngineering and Rehabilitation ◽

10.1186/s12984-021-00894-4 ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Fahad Kamran ◽

Kathryn Harrold ◽

Jonathan Zwier ◽

Wendy Carender ◽

Tian Bao ◽

...

Keyword(s):

Machine Learning ◽

Inertial Measurement Unit ◽

Physical Therapists ◽

Physical Therapist ◽

Model Performance ◽

Machine Learning Techniques ◽

Measurement Unit ◽

Balance Performance ◽

Inertial Measurement ◽

Self Assessments

Abstract Background Recently, machine learning techniques have been applied to data collected from inertial measurement units to automatically assess balance, but rely on hand-engineered features. We explore the utility of machine learning to automatically extract important features from inertial measurement unit data for balance assessment. Findings Ten participants with balance concerns performed multiple balance exercises in a laboratory setting while wearing an inertial measurement unit on their lower back. Physical therapists watched video recordings of participants performing the exercises and rated balance on a 5-point scale. We trained machine learning models using different representations of the unprocessed inertial measurement unit data to estimate physical therapist ratings. On a held-out test set, we compared these learned models to one another, to participants’ self-assessments of balance, and to models trained using hand-engineered features. Utilizing the unprocessed kinematic data from the inertial measurement unit provided significant improvements over both self-assessments and models using hand-engineered features (AUROC of 0.806 vs. 0.768, 0.665). Conclusions Unprocessed data from an inertial measurement unit used as input to a machine learning model produced accurate estimates of balance performance. The ability to learn from unprocessed data presents a potentially generalizable approach for assessing balance without the need for labor-intensive feature engineering, while maintaining comparable model performance.

Download Full-text

Predicting Future Occurrence of Acute Hypotensive Episodes Using Noninvasive and Invasive Features

Military Medicine ◽

10.1093/milmed/usaa418 ◽

2021 ◽

Vol 186 (Supplement_1) ◽

pp. 445-451

Author(s):

Yifei Sun ◽

Navid Rashedi ◽

Vikrant Vaze ◽

Parikshit Shah ◽

Ryan Halter ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Real World ◽

Short Term Memory ◽

Model Performance ◽

Learning Technologies ◽

Machine Learning Algorithms ◽

Support Vector ◽

K Nearest Neighbor ◽

Continuous Map

ABSTRACT Introduction Early prediction of the acute hypotensive episode (AHE) in critically ill patients has the potential to improve outcomes. In this study, we apply different machine learning algorithms to the MIMIC III Physionet dataset, containing more than 60,000 real-world intensive care unit records, to test commonly used machine learning technologies and compare their performances. Materials and Methods Five classification methods including K-nearest neighbor, logistic regression, support vector machine, random forest, and a deep learning method called long short-term memory are applied to predict an AHE 30 minutes in advance. An analysis comparing model performance when including versus excluding invasive features was conducted. To further study the pattern of the underlying mean arterial pressure (MAP), we apply a regression method to predict the continuous MAP values using linear regression over the next 60 minutes. Results Support vector machine yields the best performance in terms of recall (84%). Including the invasive features in the classification improves the performance significantly with both recall and precision increasing by more than 20 percentage points. We were able to predict the MAP with a root mean square error (a frequently used measure of the differences between the predicted values and the observed values) of 10 mmHg 60 minutes in the future. After converting continuous MAP predictions into AHE binary predictions, we achieve a 91% recall and 68% precision. In addition to predicting AHE, the MAP predictions provide clinically useful information regarding the timing and severity of the AHE occurrence. Conclusion We were able to predict AHE with precision and recall above 80% 30 minutes in advance with the large real-world dataset. The prediction of regression model can provide a more fine-grained, interpretable signal to practitioners. Model performance is improved by the inclusion of invasive features in predicting AHE, when compared to predicting the AHE based on only the available, restricted set of noninvasive technologies. This demonstrates the importance of exploring more noninvasive technologies for AHE prediction.

Download Full-text

Forecasting and trading cryptocurrencies with machine learning under changing market conditions

Financial Innovation ◽

10.1186/s40854-020-00217-x ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Helder Sebastião ◽

Pedro Godinho

Keyword(s):

Machine Learning ◽

Linear Models ◽

Test Sample ◽

Trading Strategies ◽

Network Activity ◽

Machine Learning Techniques ◽

Support Vector ◽

Success Rates ◽

Market Conditions ◽

Sharpe Ratios

AbstractThis study examines the predictability of three major cryptocurrencies—bitcoin, ethereum, and litecoin—and the profitability of trading strategies devised upon machine learning techniques (e.g., linear models, random forests, and support vector machines). The models are validated in a period characterized by unprecedented turmoil and tested in a period of bear markets, allowing the assessment of whether the predictions are good even when the market direction changes between the validation and test periods. The classification and regression methods use attributes from trading and network activity for the period from August 15, 2015 to March 03, 2019, with the test sample beginning on April 13, 2018. For the test period, five out of 18 individual models have success rates of less than 50%. The trading strategies are built on model assembling. The ensemble assuming that five models produce identical signals (Ensemble 5) achieves the best performance for ethereum and litecoin, with annualized Sharpe ratios of 80.17% and 91.35% and annualized returns (after proportional round-trip trading costs of 0.5%) of 9.62% and 5.73%, respectively. These positive results support the claim that machine learning provides robust techniques for exploring the predictability of cryptocurrencies and for devising profitable trading strategies in these markets, even under adverse market conditions.

Download Full-text

Automobile tire life prediction based on image processing and machine learning technology

Advances in Mechanical Engineering ◽

10.1177/16878140211002727 ◽

2021 ◽

Vol 13 (3) ◽

pp. 168781402110027

Author(s):

Jianchen Zhu ◽

Kaixin Han ◽

Shenlong Wang

Keyword(s):

Machine Learning ◽

Image Processing ◽

Life Prediction ◽

Traffic Accidents ◽

Confusion Matrix ◽

Small Sample ◽

Image Feature ◽

Engineering Practice ◽

Learning Technology ◽

K Nearest Neighbor

With economic growth, automobiles have become an irreplaceable means of transportation and travel. Tires are important parts of automobiles, and their wear causes a large number of traffic accidents. Therefore, predicting tire life has become one of the key factors determining vehicle safety. This paper presents a tire life prediction method based on image processing and machine learning. We first build an original image database as the initial sample. Since there are usually only a few sample image libraries in engineering practice, we propose a new image feature extraction and expression method that shows excellent performance for a small sample database. We extract the texture features of the tire image by using the gray-gradient co-occurrence matrix (GGCM) and the Gauss-Markov random field (GMRF), and classify the extracted features by using the K-nearest neighbor (KNN) classifier. We then conduct experiments and predict the wear life of automobile tires. The experimental results are estimated by using the mean average precision (MAP) and confusion matrix as evaluation criteria. Finally, we verify the effectiveness and accuracy of the proposed method for predicting tire life. The obtained results are expected to be used for real-time prediction of tire life, thereby reducing tire-related traffic accidents.

Download Full-text

Lateral-size control of exfoliated transition-metal–oxide 2D materials by machine learning on small data

Nanoscale ◽

10.1039/d0nr08684c ◽

2021 ◽

Vol 13 (6) ◽

pp. 3853-3859

Author(s):

Ryosuke Mizuguchi ◽

Yasuhiko Igarashi ◽

Hiroaki Imai ◽

Yuya Oaki

Keyword(s):

Machine Learning ◽

Transition Metal ◽

Metal Oxide ◽

2D Materials ◽

Size Control ◽

Transition Metal Oxide ◽

Small Data ◽

Lateral Size

Lateral sizes of the exfoliated transition-metal–oxide nanosheets were predicted and controlled by the assistance of machine learning.

Download Full-text

Patient-Specific Predictive Antibiogram in Decision Support for Empiric Antibiotic Treatment

Infection Control and Hospital Epidemiology ◽

10.1017/ice.2020.1205 ◽

2020 ◽

Vol 41 (S1) ◽

pp. s521-s522

Author(s):

Debarka Sengupta ◽

Vaibhav Singh ◽

Seema Singh ◽

Dinesh Tewari ◽

Mudit Kapoor ◽

...

Keyword(s):

Machine Learning ◽

Antimicrobial Resistance ◽

Model Building ◽

Medical Center ◽

Bacterial Species ◽

Model Performance ◽

The United States ◽

Patient Specific ◽

Gradient Boosting ◽

Comparative Performance

Background: The rising trend of antibiotic resistance imposes a heavy burden on healthcare both clinically and economically (US$55 billion), with 23,000 estimated annual deaths in the United States as well as increased length of stay and morbidity. Machine-learning–based methods have, of late, been used for leveraging patient’s clinical history and demographic information to predict antimicrobial resistance. We developed a machine-learning model ensemble that maximizes the accuracy of such a drug-sensitivity versus resistivity classification system compared to the existing best-practice methods. Methods: We first performed a comprehensive analysis of the association between infecting bacterial species and patient factors, including patient demographics, comorbidities, and certain healthcare-specific features. We leveraged the predictable nature of these complex associations to infer patient-specific antibiotic sensitivities. Various base-learners, including k-NN (k-nearest neighbors) and gradient boosting machine (GBM), were used to train an ensemble model for confident prediction of antimicrobial susceptibilities. Base learner selection and model performance evaluation was performed carefully using a variety of standard metrics, namely accuracy, precision, recall, F1 score, and Cohen κ. Results: For validating the performance on MIMIC-III database harboring deidentified clinical data of 53,423 distinct patient admissions between 2001 and 2012, in the intensive care units (ICUs) of the Beth Israel Deaconess Medical Center in Boston, Massachusetts. From ~11,000 positive cultures, we used 4 major specimen types namely urine, sputum, blood, and pus swab for evaluation of the model performance. Figure 1 shows the receiver operating characteristic (ROC) curves obtained for bloodstream infection cases upon model building and prediction on 70:30 split of the data. We received area under the curve (AUC) values of 0.88, 0.92, 0.92, and 0.94 for urine, sputum, blood, and pus swab samples, respectively. Figure 2 shows the comparative performance of our proposed method as well as some off-the-shelf classification algorithms. Conclusions: Highly accurate, patient-specific predictive antibiogram (PSPA) data can aid clinicians significantly in antibiotic recommendation in ICU, thereby accelerating patient recovery and curbing antimicrobial resistance.Funding: This study was supported by Circle of Life Healthcare Pvt. Ltd.Disclosures: None

Download Full-text