Doing more with less

2021 ◽  
Vol 14 (11) ◽  
pp. 2059-2072
Author(s):  
Fatjon Zogaj ◽  
José Pablo Cambronero ◽  
Martin C. Rinard ◽  
Jürgen Cito

Automated machine learning (AutoML) promises to democratize machine learning by automatically generating machine learning pipelines with little to no user intervention. Typically, a search procedure is used to repeatedly generate and validate candidate pipelines, maximizing a predictive performance metric, subject to a limited execution time budget. While this approach to generating candidates works well for small tabular datasets, the same procedure does not directly scale to larger tabular datasets with 100,000s of observations, often producing fewer candidate pipelines and yielding lower performance, given the same execution time budget. We carry out an extensive empirical evaluation of the impact that downsampling - reducing the number of rows in the input tabular dataset - has on the pipelines produced by a genetic-programming-based AutoML search for classification tasks.

2021 ◽  
Author(s):  
Sebastian Johannes Fritsch ◽  
Konstantin Sharafutdinov ◽  
Moein Einollahzadeh Samadi ◽  
Gernot Marx ◽  
Andreas Schuppert ◽  
...  

BACKGROUND During the course of the COVID-19 pandemic, a variety of machine learning models were developed to predict different aspects of the disease, such as long-term causes, organ dysfunction or ICU mortality. The number of training datasets used has increased significantly over time. However, these data now come from different waves of the pandemic, not always addressing the same therapeutic approaches over time as well as changing outcomes between two waves. The impact of these changes on model development has not yet been studied. OBJECTIVE The aim of the investigation was to examine the predictive performance of several models trained with data from one wave predicting the second wave´s data and the impact of a pooling of these data sets. Finally, a method for comparison of different datasets for heterogeneity is introduced. METHODS We used two datasets from wave one and two to develop several predictive models for mortality of the patients. Four classification algorithms were used: logistic regression (LR), support vector machine (SVM), random forest classifier (RF) and AdaBoost classifier (ADA). We also performed a mutual prediction on the data of that wave which was not used for training. Then, we compared the performance of models when a pooled dataset from two waves was used. The populations from the different waves were checked for heterogeneity using a convex hull analysis. RESULTS 63 patients from wave one (03-06/2020) and 54 from wave two (08/2020-01/2021) were evaluated. For both waves separately, we found models reaching sufficient accuracies up to 0.79 AUROC (95%-CI 0.76-0.81) for SVM on the first wave and up 0.88 AUROC (95%-CI 0.86-0.89) for RF on the second wave. After the pooling of the data, the AUROC decreased relevantly. In the mutual prediction, models trained on second wave´s data showed, when applied on first wave´s data, a good prediction for non-survivors but an insufficient classification for survivors. The opposite situation (training: first wave, test: second wave) revealed the inverse behaviour with models correctly classifying survivors and incorrectly predicting non-survivors. The convex hull analysis for the first and second wave populations showed a more inhomogeneous distribution of underlying data when compared to randomly selected sets of patients of the same size. CONCLUSIONS Our work demonstrates that a larger dataset is not a universal solution to all machine learning problems in clinical settings. Rather, it shows that inhomogeneous data used to develop models can lead to serious problems. With the convex hull analysis, we offer a solution for this problem. The outcome of such an analysis can raise concerns if the pooling of different datasets would cause inhomogeneous patterns preventing a better predictive performance.


2021 ◽  
Vol 25 (5) ◽  
pp. 1073-1098
Author(s):  
Nor Hamizah Miswan ◽  
Chee Seng Chan ◽  
Chong Guan Ng

Hospital readmission is a major cost for healthcare systems worldwide. If patients with a higher potential of readmission could be identified at the start, existing resources could be used more efficiently, and appropriate plans could be implemented to reduce the risk of readmission. Therefore, it is important to predict the right target patients. Medical data is usually noisy, incomplete, and inconsistent. Hence, before developing a prediction model, it is crucial to efficiently set up the predictive model so that improved predictive performance is achieved. The current study aims to analyse the impact of different preprocessing methods on the performance of different machine learning classifiers. The preprocessing applied by previous hospital readmission studies were compared, and the most common approaches highlighted such as missing value imputation, feature selection, data balancing, and feature scaling. The hyperparameters were selected using Bayesian optimisation. The different preprocessing pipelines were assessed using various performance metrics and computational costs. The results indicated that the preprocessing approaches helped improve the model’s prediction of hospital readmission.


Author(s):  
Michael McCartney ◽  
Matthias Haeringer ◽  
Wolfgang Polifke

Abstract This paper examines and compares commonly used Machine Learning algorithms in their performance in interpolation and extrapolation of FDFs, based on experimental and simulation data. Algorithm performance is evaluated by interpolating and extrapolating FDFs and then the impact of errors on the limit cycle amplitudes are evaluated using the xFDF framework. The best algorithms in interpolation and extrapolation were found to be the widely used cubic spline interpolation, as well as the Gaussian Processes regressor. The data itself was found to be an important factor in defining the predictive performance of a model, therefore a method of optimally selecting data points at test time using Gaussian Processes was demonstrated. The aim of this is to allow a minimal amount of data points to be collected while still providing enough information to model the FDF accurately. The extrapolation performance was shown to decay very quickly with distance from the domain and so emphasis should be put on selecting measurement points in order to expand the covered domain. Gaussian Processes also give an indication of confidence on its predictions and is used to carry out uncertainty quantification, in order to understand model sensitivities. This was demonstrated through application to the xFDF framework.


2020 ◽  
Author(s):  
Alex Sun ◽  
Bridget Scanlon ◽  
Himanshu Save ◽  
Ashraf Rateb

<p>The GRACE satellite mission and its follow-on, GRACE-FO, have provided unprecedented opportunities to quantify the impact of climate extremes and human activities on total water storage at large scales. The approximately one-year data gap between the two GRACE missions needs to be filled to maintain data continuity and maximize mission benefits. There is strong interest in using machine learning (ML) algorithms to reconstruct GRACE-like data to fill this gap. So far, most studies attempted to train and select a single ML algorithm to work for global basins. However, hydrometeorological predictors may exhibit strong spatial variability which, in turn, may affect the performance of ML models. Existing studies have already shown that no single algorithm consistently outperformed others over all global basins. In this study, we applied an automated machine learning (AutoML) workflow to perform GRACE data reconstruction. AutoML represents a new paradigm for optimal model structure selection, hyperparameter tuning, and model ensemble stacking, addressing some of the most challenging issues related to ML applications. We demonstrated the AutoML workflow over the conterminous U.S. (CONUS) using six types of ML algorithms and multiple groups of meteorological and climatic variables as predictors. Results indicate that the AutoML-assisted gap filling achieved satisfactory performance over the CONUS. For the testing period (2014/06–2017/06), the mean gridwise Nash-Sutcliffe efficiency is around 0.85, the mean correlation coefficient is around 0.95, and the mean normalized root-mean square error is about 0.09. Trained models maintain good performance when extrapolating to the mission gap and to GRACE-FO periods (after 2017/06). Results further suggest that no single algorithm provides the best predictive performance over the entire CONUS, stressing the importance of using an end-to-end workflow to train, optimize, and combine multiple machine learning models to deliver robust performance, especially when building large-scale hydrological prediction systems and when predictor importance exhibits strong spatial variability.</p>


2020 ◽  
Author(s):  
Fenglong Yang ◽  
Quan Zou

AbstractDue to the concerted efforts to utilize the microbial features to improve disease prediction capabilities, automated machine learning (AutoML) systems designed to get rid of the tediousness in manually performing ML tasks are in great demand. Here we developed mAML, an ML model-building pipeline, which can automatically and rapidly generate optimized and interpretable models for personalized microbial classification tasks in a reproducible way. The pipeline is deployed on a web-based platform and the server is user-friendly, flexible, and has been designed to be scalable according to the specific requirements. This pipeline exhibits high performance for 13 benchmark datasets including both binary and multi-class classification tasks. In addition, to facilitate the application of mAML and expand the human disease-related microbiome learning repository, we developed GMrepo ML repository (GMrepo Microbiome Learning repository) from the GMrepo database. The repository involves 120 microbial classification tasks for 85 human-disease phenotypes referring to 12,429 metagenomic samples and 38,643 amplicon samples. The mAML pipeline and the GMrepo ML repository are expected to be important resources for researches in microbiology and algorithm developments.Database URLhttp://39.100.246.211:8050/Home


2022 ◽  
Vol 9 (1) ◽  
pp. 0-0

This article investigates the impact of data-complexity and team-specific characteristics on machine learning competition scores. Data from five real-world binary classification competitions hosted on Kaggle.com were analyzed. The data-complexity characteristics were measured in four aspects including standard measures, sparsity measures, class imbalance measures, and feature-based measures. The results showed that the higher the level of the data-complexity characteristics was, the lower the predictive ability of the machine learning model was as well. Our empirical evidence revealed that the imbalance ratio of the target variable was the most important factor and exhibited a nonlinear relationship with the model’s predictive abilities. The imbalance ratio adversely affected the predictive performance when it reached a certain level. However, mixed results were found for the impact of team-specific characteristics measured by team size, team expertise, and the number of submissions on team performance. For high-performing teams, these factors had no impact on team score.


2020 ◽  
Vol 73 (4) ◽  
pp. 285-295 ◽  
Author(s):  
Dongwoo Chae

Machine learning (ML) is revolutionizing anesthesiology research. Unlike classical research methods that are largely inference-based, ML is geared more towards making accurate predictions. ML is a field of artificial intelligence concerned with developing algorithms and models to perform prediction tasks in the absence of explicit instructions. Most ML applications, despite being highly variable in the topics that they deal with, generally follow a common workflow. For classification tasks, a researcher typically tests various ML models and compares the predictive performance with the reference logistic regression model. The main advantage of ML lies in its ability to deal with many features with complex interactions and its specific focus on maximizing predictive performance. However, emphasis on data-driven prediction can sometimes neglect mechanistic understanding. This article mainly focuses on the application of supervised ML to electronic health record (EHR) data. The main limitation of EHR-based studies is in the difficulty of establishing causal relationships. However, the associated low cost and rich information content provide great potential to uncover hitherto unknown correlations. In this review, the basic concepts of ML are introduced along with important terms that any ML researcher should know. Practical tips regarding the choice of software and computing devices are also provided. Towards the end, several examples of successful ML applications in anesthesiology are discussed. The goal of this article is to provide a basic roadmap to novice ML researchers working in the field of anesthesiology.


Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Fenglong Yang ◽  
Quan Zou

Abstract Due to the concerted efforts to utilize the microbial features to improve disease prediction capabilities, automated machine learning (AutoML) systems aiming to get rid of the tediousness in manually performing ML tasks are in great demand. Here we developed mAML, an ML model-building pipeline, which can automatically and rapidly generate optimized and interpretable models for personalized microbiome-based classification tasks in a reproducible way. The pipeline is deployed on a web-based platform, while the server is user-friendly and flexible and has been designed to be scalable according to the specific requirements. This pipeline exhibits high performance for 13 benchmark datasets including both binary and multi-class classification tasks. In addition, to facilitate the application of mAML and expand the human disease-related microbiome learning repository, we developed GMrepo ML repository (GMrepo Microbiome Learning repository) from the GMrepo database. The repository involves 120 microbiome-based classification tasks for 85 human-disease phenotypes referring to 12 429 metagenomic samples and 38 643 amplicon samples. The mAML pipeline and the GMrepo ML repository are expected to be important resources for researches in microbiology and algorithm developments. Database URL: http://lab.malab.cn/soft/mAML


Author(s):  
Gilles Ottervanger ◽  
Mitra Baratchi ◽  
Holger H. Hoos

AbstractEarly time series classification (EarlyTSC) involves the prediction of a class label based on partial observation of a given time series. Most EarlyTSC algorithms consider the trade-off between accuracy and earliness as two competing objectives, using a single dedicated hyperparameter. To obtain insights into this trade-off requires finding a set of non-dominated (Pareto efficient) classifiers. So far, this has been approached through manual hyperparameter tuning. Since the trade-off hyperparameters only provide indirect control over the earliness-accuracy trade-off, manual tuning is tedious and tends to result in many sub-optimal hyperparameter settings. This complicates the search for optimal hyperparameter settings and forms a hurdle for the application of EarlyTSC to real-world problems. To address these issues, we propose an automated approach to hyperparameter tuning and algorithm selection for EarlyTSC, building on developments in the fast-moving research area known as automated machine learning (AutoML). To deal with the challenging task of optimising two conflicting objectives in early time series classification, we propose MultiETSC, a system for multi-objective algorithm selection and hyperparameter optimisation (MO-CASH) for EarlyTSC. MultiETSC can potentially leverage any existing or future EarlyTSC algorithm and produces a set of Pareto optimal algorithm configurations from which a user can choose a posteriori. As an additional benefit, our proposed framework can incorporate and leverage time-series classification algorithms not originally designed for EarlyTSC for improving performance on EarlyTSC; we demonstrate this property using a newly defined, “naïve” fixed-time algorithm. In an extensive empirical evaluation of our new approach on a benchmark of 115 data sets, we show that MultiETSC performs substantially better than baseline methods, ranking highest (avg. rank 1.98) compared to conceptually simpler single-algorithm (2.98) and single-objective alternatives (4.36).


Sign in / Sign up

Export Citation Format

Share Document