scholarly journals The impact of data-complexity and team characteristics on performance in the classification model

2022 ◽  
Vol 9 (1) ◽  
pp. 0-0

This article investigates the impact of data-complexity and team-specific characteristics on machine learning competition scores. Data from five real-world binary classification competitions hosted on Kaggle.com were analyzed. The data-complexity characteristics were measured in four aspects including standard measures, sparsity measures, class imbalance measures, and feature-based measures. The results showed that the higher the level of the data-complexity characteristics was, the lower the predictive ability of the machine learning model was as well. Our empirical evidence revealed that the imbalance ratio of the target variable was the most important factor and exhibited a nonlinear relationship with the model’s predictive abilities. The imbalance ratio adversely affected the predictive performance when it reached a certain level. However, mixed results were found for the impact of team-specific characteristics measured by team size, team expertise, and the number of submissions on team performance. For high-performing teams, these factors had no impact on team score.

Metabolomics ◽  
2019 ◽  
Vol 15 (12) ◽  
Author(s):  
Kevin M. Mendez ◽  
Stacey N. Reinke ◽  
David I. Broadhurst

Abstract Introduction Metabolomics is increasingly being used in the clinical setting for disease diagnosis, prognosis and risk prediction. Machine learning algorithms are particularly important in the construction of multivariate metabolite prediction. Historically, partial least squares (PLS) regression has been the gold standard for binary classification. Nonlinear machine learning methods such as random forests (RF), kernel support vector machines (SVM) and artificial neural networks (ANN) may be more suited to modelling possible nonlinear metabolite covariance, and thus provide better predictive models. Objectives We hypothesise that for binary classification using metabolomics data, non-linear machine learning methods will provide superior generalised predictive ability when compared to linear alternatives, in particular when compared with the current gold standard PLS discriminant analysis. Methods We compared the general predictive performance of eight archetypal machine learning algorithms across ten publicly available clinical metabolomics data sets. The algorithms were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks. Results There was only marginal improvement in predictive ability for SVM and ANN over PLS across all data sets. RF performance was comparatively poor. The use of out-of-bag bootstrap confidence intervals provided a measure of uncertainty of model prediction such that the quality of metabolomics data was observed to be a bigger influence on generalised performance than model choice. Conclusion The size of the data set, and choice of performance metric, had a greater influence on generalised predictive performance than the choice of machine learning algorithm.


Author(s):  
Joseph D. Romano ◽  
Trang T. Le ◽  
Weixuan Fu ◽  
Jason H. Moore

AbstractAutomated machine learning (AutoML) and artificial neural networks (ANNs) have revolutionized the field of artificial intelligence by yielding incredibly high-performing models to solve a myriad of inductive learning tasks. In spite of their successes, little guidance exists on when to use one versus the other. Furthermore, relatively few tools exist that allow the integration of both AutoML and ANNs in the same analysis to yield results combining both of their strengths. Here, we present TPOT-NN—a new extension to the tree-based AutoML software TPOT—and use it to explore the behavior of automated machine learning augmented with neural network estimators (AutoML+NN), particularly when compared to non-NN AutoML in the context of simple binary classification on a number of public benchmark datasets. Our observations suggest that TPOT-NN is an effective tool that achieves greater classification accuracy than standard tree-based AutoML on some datasets, with no loss in accuracy on others. We also provide preliminary guidelines for performing AutoML+NN analyses, and recommend possible future directions for AutoML+NN methods research, especially in the context of TPOT.


2021 ◽  
Vol 251 ◽  
pp. 01017
Author(s):  
Zhixiang Lu

With the vigorous development of the sharing economy, the short-term rental industry has also spawned many emerging industries that belong to the sharing economy. However, due to the impact of the COVID-19 pandemic in 2020, many sharing economy industries, including the short-term housing leasing industry, have been affected. This study takes the rental information of 1,004 short-term rental houses in New York in April 2020 as an example, through machine learning and quantitative analysis, we conducted statistical and visual analysis on the impact of different factors on the housing rental status. This project is based on the machine learning model to predict the changes in the rental status of the house on the time series. The results show that the prediction accuracy of the random forest model has reached more than 94%, and the prediction accuracy of the logistic model has reached more than 74%. At the same time, we have further explored the impact of time span differences and regional differences on the housing rental status.


Author(s):  
Xianping Du ◽  
Onur Bilgen ◽  
Hongyi Xu

Abstract Machine learning for classification has been used widely in engineering design, for example, feasible domain recognition and hidden pattern discovery. Training an accurate machine learning model requires a large dataset; however, high computational or experimental costs are major issues in obtaining a large dataset for real-world problems. One possible solution is to generate a large pseudo dataset with surrogate models, which is established with a smaller set of real training data. However, it is not well understood whether the pseudo dataset can benefit the classification model by providing more information or deteriorates the machine learning performance due to the prediction errors and uncertainties introduced by the surrogate model. This paper presents a preliminary investigation towards this research question. A classification-and-regressiontree model is employed to recognize the design subspaces to support design decision-making. It is implemented on the geometric design of a vehicle energy-absorbing structure based on finite element simulations. Based on a small set of real-world data obtained by simulations, a surrogate model based on Gaussian process regression is employed to generate pseudo datasets for training. The results showed that the tree-based method could help recognize feasible design domains efficiently. Furthermore, the additional information provided by the surrogate model enhances the accuracy of classification. One important conclusion is that the accuracy of the surrogate model determines the quality of the pseudo dataset and hence, the improvements in the machine learning model.


2021 ◽  
Author(s):  
Sebastian Johannes Fritsch ◽  
Konstantin Sharafutdinov ◽  
Moein Einollahzadeh Samadi ◽  
Gernot Marx ◽  
Andreas Schuppert ◽  
...  

BACKGROUND During the course of the COVID-19 pandemic, a variety of machine learning models were developed to predict different aspects of the disease, such as long-term causes, organ dysfunction or ICU mortality. The number of training datasets used has increased significantly over time. However, these data now come from different waves of the pandemic, not always addressing the same therapeutic approaches over time as well as changing outcomes between two waves. The impact of these changes on model development has not yet been studied. OBJECTIVE The aim of the investigation was to examine the predictive performance of several models trained with data from one wave predicting the second wave´s data and the impact of a pooling of these data sets. Finally, a method for comparison of different datasets for heterogeneity is introduced. METHODS We used two datasets from wave one and two to develop several predictive models for mortality of the patients. Four classification algorithms were used: logistic regression (LR), support vector machine (SVM), random forest classifier (RF) and AdaBoost classifier (ADA). We also performed a mutual prediction on the data of that wave which was not used for training. Then, we compared the performance of models when a pooled dataset from two waves was used. The populations from the different waves were checked for heterogeneity using a convex hull analysis. RESULTS 63 patients from wave one (03-06/2020) and 54 from wave two (08/2020-01/2021) were evaluated. For both waves separately, we found models reaching sufficient accuracies up to 0.79 AUROC (95%-CI 0.76-0.81) for SVM on the first wave and up 0.88 AUROC (95%-CI 0.86-0.89) for RF on the second wave. After the pooling of the data, the AUROC decreased relevantly. In the mutual prediction, models trained on second wave´s data showed, when applied on first wave´s data, a good prediction for non-survivors but an insufficient classification for survivors. The opposite situation (training: first wave, test: second wave) revealed the inverse behaviour with models correctly classifying survivors and incorrectly predicting non-survivors. The convex hull analysis for the first and second wave populations showed a more inhomogeneous distribution of underlying data when compared to randomly selected sets of patients of the same size. CONCLUSIONS Our work demonstrates that a larger dataset is not a universal solution to all machine learning problems in clinical settings. Rather, it shows that inhomogeneous data used to develop models can lead to serious problems. With the convex hull analysis, we offer a solution for this problem. The outcome of such an analysis can raise concerns if the pooling of different datasets would cause inhomogeneous patterns preventing a better predictive performance.


2021 ◽  
Vol 25 (5) ◽  
pp. 1073-1098
Author(s):  
Nor Hamizah Miswan ◽  
Chee Seng Chan ◽  
Chong Guan Ng

Hospital readmission is a major cost for healthcare systems worldwide. If patients with a higher potential of readmission could be identified at the start, existing resources could be used more efficiently, and appropriate plans could be implemented to reduce the risk of readmission. Therefore, it is important to predict the right target patients. Medical data is usually noisy, incomplete, and inconsistent. Hence, before developing a prediction model, it is crucial to efficiently set up the predictive model so that improved predictive performance is achieved. The current study aims to analyse the impact of different preprocessing methods on the performance of different machine learning classifiers. The preprocessing applied by previous hospital readmission studies were compared, and the most common approaches highlighted such as missing value imputation, feature selection, data balancing, and feature scaling. The hyperparameters were selected using Bayesian optimisation. The different preprocessing pipelines were assessed using various performance metrics and computational costs. The results indicated that the preprocessing approaches helped improve the model’s prediction of hospital readmission.


2016 ◽  
Vol 7 (2) ◽  
pp. 43-71 ◽  
Author(s):  
Sangeeta Lal ◽  
Neetu Sardana ◽  
Ashish Sureka

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.


2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Patricio Wolff ◽  
Manuel Graña ◽  
Sebastián A. Ríos ◽  
Maria Begoña Yarza

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.


Sign in / Sign up

Export Citation Format

Share Document