Prediction intervals with random forests

2019 ◽  
Vol 29 (1) ◽  
pp. 205-229 ◽  
Author(s):  
Marie-Hélène Roy ◽  
Denis Larocque

The classical and most commonly used approach to building prediction intervals is the parametric approach. However, its main drawback is that its validity and performance highly depend on the assumed functional link between the covariates and the response. This research investigates new methods that improve the performance of prediction intervals with random forests. Two aspects are explored: The method used to build the forest and the method used to build the prediction interval. Four methods to build the forest are investigated, three from the classification and regression tree (CART) paradigm and the transformation forest method. For CART forests, in addition to the default least-squares splitting rule, two alternative splitting criteria are investigated. We also present and evaluate the performance of five flexible methods for constructing prediction intervals. This yields 20 distinct method variations. To reliably attain the desired confidence level, we include a calibration procedure performed on the out-of-bag information provided by the forest. The 20 method variations are thoroughly investigated, and compared to five alternative methods through simulation studies and in real data settings. The results show that the proposed methods are very competitive. They outperform commonly used methods in both in simulation settings and with real data.

10.2196/18910 ◽  
2020 ◽  
Vol 8 (7) ◽  
pp. e18910
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.


2019 ◽  
Vol 9 (2) ◽  
Author(s):  
Kim Ward ◽  
Chantal Larose

Objectives: This research brief explores literature addressing developmental education to identify successful interventions in first-year math courses in higher education. Our goal is to describe the relationship between students’ academic practices and their final course grade in their first-year math courses. Method: Data on 3,249 students have been gathered and analyzed using descriptive statistics and predicative analytics. We describe the Math program, which includes a supplemental support component, and the environment under which it was created. We then examine the behavior between students’ participation in supplemental support and their academic performance. Results: We used classification and regression tree algorithms to obtain a model that gave us data-driven guidelines to aid with future student interventions and success in their first-year math courses. Conclusions: Students’ fulfillment of the supplemental support requirements by specified deadlines is a key predictor of students’ midterm and final course grades.  Implications for Theory and/or Practice: This work provides a roadmap for student interventions and increasing student success with first-year mathematics courses. Keywords: First-year mathematics courses, supplemental support, higher education


2020 ◽  
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

BACKGROUND The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. OBJECTIVE This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. RESULTS A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.


Author(s):  
K Sumanth Reddy ◽  
Gaddam Pranith ◽  
Karre Varun ◽  
Thipparthy Surya Sai Teja

The compressive strength of concrete plays an important role in determining the durability and performance of concrete. Due to rapid growth in material engineering finalizing an appropriate proportion for the mix of concrete to obtain the desired compressive strength of concrete has become cumbersome and a laborious task further the problem becomes more complex to obtain a rational relation between the concrete materials used to the strength obtained. The development in computational methods can be used to obtain a rational relation between the materials used and the compressive strength using machine learning techniques which reduces the influence of outliers and all unwanted variables influence in the determination of compressive strength. In this paper basic machine learning technics Multilayer perceptron neural network (MLP), Support Vector Machines (SVM), linear regressions (LR) and Classification and Regression Tree (CART), have been used to develop a model for determining the compressive strength for two different set of data (ingredients). Among all technics used the SVM provides a better results in comparison to other, but comprehensively the SVM cannot be a universal model because many recent literatures have proved that such models need more data and also the dynamicity of the attributes involved play an important role in determining the efficacy of the model.


Author(s):  
Wassim R. Abou Ghaida ◽  
Ayman Baklizi

AbstractWe consider the prediction of future observations from the log-logistic distribution. The data is assumed hybrid right censored with possible left censoring. Different point predictors were derived. Specifically, we obtained the best unbiased, the conditional median, and the maximum likelihood predictors. Prediction intervals were derived using suitable pivotal quantities and intervals based on the highest density. We conducted a simulation study to compare the point and interval predictors. It is found that the point predictor BUP and the prediction interval HDI have the best overall performance. An illustrative example based on real data is given.


2021 ◽  
pp. 1-8
Author(s):  
Binod Balakrishnan ◽  
Heather VanDongen-Trimmer ◽  
Irene Kim ◽  
Sheila J. Hanson ◽  
Liyun Zhang ◽  
...  

<b><i>Background:</i></b> The Glasgow Coma Scale (GCS), used to classify the severity of traumatic brain injury (TBI), is associated with mortality and functional outcomes. However, GCS can be affected by sedation and neuromuscular blockade. GCS-Pupil (GCS-P) score, calculated as GCS minus Pupil Reactivity Score (PRS), was shown to better predict outcomes in a retrospective cohort of adult TBI patients. We evaluated the applicability of GCS-P to a large retrospective pediatric severe TBI (sTBI) cohort. <b><i>Methods:</i></b> Admissions to pediatric intensive care units in the Virtual Pediatric Systems (VPS, LLC) database from 2010 to 2015 with sTBI were included. We collected GCS, PRS (number of nonreactive pupils), cardiac arrest, abusive head trauma status, illness severity scores, pediatric cerebral performance category (PCPC) score, and mortality. GCS-P was calculated as GCS minus PRS. χ<sup>2</sup> or Fisher’s exact test and Mann-Whitney U test compared categorical and continuous variables, respectively. Classification and regression tree analysis identified thresholds of GCS-P and GCS along with other independent factors which were further examined using multivariable regression analysis to identify factors independently associated with mortality and unfavorable PCPC at PICU discharge. <b><i>Results:</i></b> Among the 2,682 patients included in the study, mortality was 23%, increasing from 4.7% for PRS = 0 to 80% for PRS = 2. GCS-P identified more severely injured patients with GCS-P scores 1 and 2 who had worse outcomes. GCS-P ≤ 2 had higher odds for mortality, OR = 68.4 (95% CI = 50.6–92.4) and unfavorable PCPC, OR = 17.3 (8.1, 37.0) compared to GCS ≤ 5. GCS-P ≤ 2 also had higher specificity and positive predictive value for both mortality and unfavorable PCPC compared to GCS ≤ 5. <b><i>Conclusions:</i></b> GCS-P, by incorporating pupil reactivity to GCS scoring, is more strongly associated with mortality and poor functional outcome at PICU discharge in children with sTBI.


Sign in / Sign up

Export Citation Format

Share Document