Prediction intervals with random forests

The classical and most commonly used approach to building prediction intervals is the parametric approach. However, its main drawback is that its validity and performance highly depend on the assumed functional link between the covariates and the response. This research investigates new methods that improve the performance of prediction intervals with random forests. Two aspects are explored: The method used to build the forest and the method used to build the prediction interval. Four methods to build the forest are investigated, three from the classification and regression tree (CART) paradigm and the transformation forest method. For CART forests, in addition to the default least-squares splitting rule, two alternative splitting criteria are investigated. We also present and evaluate the performance of five flexible methods for constructing prediction intervals. This yields 20 distinct method variations. To reliably attain the desired confidence level, we include a calibration procedure performed on the out-of-bag information provided by the forest. The 20 method variations are thoroughly investigated, and compared to five alternative methods through simulation studies and in real data settings. The results show that the proposed methods are very competitive. They outperform commonly used methods in both in simulation settings and with real data.

Download Full-text

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

JMIR Medical Informatics ◽

10.2196/18910 ◽

2020 ◽

Vol 8 (7) ◽

pp. e18910

Author(s):

Debbie Rankin ◽

Michaela Black ◽

Raymond Bond ◽

Jonathan Wallace ◽

Maurice Mulvenna ◽

...

Keyword(s):

Machine Learning ◽

Health Care ◽

Bayesian Network ◽

Synthetic Data ◽

Regression Tree ◽

Real Data ◽

Classification And Regression Tree ◽

Supervised Machine Learning ◽

Statistical Disclosure ◽

Classification And Regression

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Download Full-text

Assessing and Modeling Student Academic Practices and Performance in First-Year Mathematics Courses in Higher Education

Higher Learning Research Communications ◽

10.18870/hlrc.v9i2.451 ◽

2019 ◽

Vol 9 (2) ◽

Author(s):

Kim Ward ◽

Chantal Larose

Keyword(s):

Higher Education ◽

Regression Tree ◽

Classification And Regression Tree ◽

First Year ◽

Mathematics Courses ◽

Course Grades ◽

Or Practice ◽

Academic Practices ◽

Tree Algorithms ◽

And Performance

Objectives: This research brief explores literature addressing developmental education to identify successful interventions in first-year math courses in higher education. Our goal is to describe the relationship between students’ academic practices and their final course grade in their first-year math courses. Method: Data on 3,249 students have been gathered and analyzed using descriptive statistics and predicative analytics. We describe the Math program, which includes a supplemental support component, and the environment under which it was created. We then examine the behavior between students’ participation in supplemental support and their academic performance. Results: We used classification and regression tree algorithms to obtain a model that gave us data-driven guidelines to aid with future student interventions and success in their first-year math courses. Conclusions: Students’ fulfillment of the supplemental support requirements by specified deadlines is a key predictor of students’ midterm and final course grades. Implications for Theory and/or Practice: This work provides a roadmap for student interventions and increasing student success with first-year mathematics courses. Keywords: First-year mathematics courses, supplemental support, higher education

Download Full-text

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing (Preprint)

10.2196/preprints.18910 ◽

2020 ◽

Author(s):

Debbie Rankin ◽

Michaela Black ◽

Raymond Bond ◽

Jonathan Wallace ◽

Maurice Mulvenna ◽

...

Keyword(s):

Machine Learning ◽

Health Care ◽

Bayesian Network ◽

Synthetic Data ◽

Regression Tree ◽

Real Data ◽

Classification And Regression Tree ◽

Supervised Machine Learning ◽

Statistical Disclosure ◽

Classification And Regression

BACKGROUND The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. OBJECTIVE This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. RESULTS A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Download Full-text

Modelling and Prediction of Concrete Compressive Strength Using Machine Learning

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit217385 ◽

2021 ◽

pp. 526-532

Author(s):

K Sumanth Reddy ◽

Gaddam Pranith ◽

Karre Varun ◽

Thipparthy Surya Sai Teja

Keyword(s):

Machine Learning ◽

Compressive Strength ◽

Regression Tree ◽

Classification And Regression Tree ◽

Machine Learning Techniques ◽

Support Vector ◽

Rational Relation ◽

Compressive Strength Of Concrete ◽

And Performance ◽

Materials Used

The compressive strength of concrete plays an important role in determining the durability and performance of concrete. Due to rapid growth in material engineering finalizing an appropriate proportion for the mix of concrete to obtain the desired compressive strength of concrete has become cumbersome and a laborious task further the problem becomes more complex to obtain a rational relation between the concrete materials used to the strength obtained. The development in computational methods can be used to obtain a rational relation between the materials used and the compressive strength using machine learning techniques which reduces the influence of outliers and all unwanted variables influence in the determination of compressive strength. In this paper basic machine learning technics Multilayer perceptron neural network (MLP), Support Vector Machines (SVM), linear regressions (LR) and Classification and Regression Tree (CART), have been used to develop a model for determining the compressive strength for two different set of data (ingredients). Among all technics used the SVM provides a better results in comparison to other, but comprehensively the SVM cannot be a universal model because many recent literatures have proved that such models need more data and also the dynamicity of the attributes involved play an important role in determining the efficacy of the model.

Download Full-text

Prediction of future failures in the log-logistic distribution based on hybrid censored data

International Journal of Systems Assurance Engineering and Management ◽

10.1007/s13198-021-01510-3 ◽

2022 ◽

Author(s):

Wassim R. Abou Ghaida ◽

Ayman Baklizi

Keyword(s):

Maximum Likelihood ◽

Censored Data ◽

Simulation Study ◽

Prediction Interval ◽

Real Data ◽

Prediction Intervals ◽

Logistic Distribution ◽

Conditional Median ◽

Overall Performance

AbstractWe consider the prediction of future observations from the log-logistic distribution. The data is assumed hybrid right censored with possible left censoring. Different point predictors were derived. Specifically, we obtained the best unbiased, the conditional median, and the maximum likelihood predictors. Prediction intervals were derived using suitable pivotal quantities and intervals based on the highest density. We conducted a simulation study to compare the point and interval predictors. It is found that the point predictor BUP and the prediction interval HDI have the best overall performance. An illustrative example based on real data is given.

Download Full-text

Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction

Ecosystems ◽

10.1007/s10021-005-0054-1 ◽

2006 ◽

Vol 9 (2) ◽

pp. 181-199 ◽

Cited By ~ 1114

Author(s):

Anantha M. Prasad ◽

Louis R. Iverson ◽

Andy Liaw

Keyword(s):

Random Forests ◽

Regression Tree ◽

Classification And Regression Tree ◽

Classification And Regression

Download Full-text

Predicting the Recurrence of Child Maltreatment: A Classification and Regression Tree Analysis

PsycEXTRA Dataset ◽

10.1037/e517322011-501 ◽

2007 ◽

Author(s):

Eve Sledjeski ◽

Lisa Dierker ◽

Rebecca Brigham ◽

Eileen Breslin

Keyword(s):

Child Maltreatment ◽

Regression Tree ◽

Classification And Regression Tree ◽

Tree Analysis ◽

Regression Tree Analysis ◽

Classification And Regression

Download Full-text

Determining late predictors of outcome for acetaminophen- induced acute liver failure using classification and regression tree modeling analysis

Critical Care ◽

10.1186/cc13389 ◽

2014 ◽

Vol 18 (Suppl 1) ◽

pp. P199

Author(s):

C Karvellas ◽

J Speiser ◽

W Lee

Keyword(s):

Liver Failure ◽

Acute Liver Failure ◽

Regression Tree ◽

Classification And Regression Tree ◽

Predictors Of Outcome ◽

Modeling Analysis ◽

Tree Modeling ◽

Classification And Regression ◽

Regression Tree Modeling

Download Full-text

8 Developing a casemix classification for specialist palliative care: a multi-centre cohort study to develop a patient-specific prediction model for the cost of specialist palliative care using classification and regression tree analysis

10.1136/spcare-2021-pcc.8 ◽

2021 ◽

Author(s):

Fliss EM Murtagh ◽

Alice Firth ◽

Ping Guo ◽

Ka Man Yip ◽

Christina Ramsenthaler ◽

...

Keyword(s):

Palliative Care ◽

Regression Tree ◽

Classification And Regression Tree ◽

Patient Specific ◽

Specialist Palliative Care ◽

Tree Analysis ◽

Regression Tree Analysis ◽

Specific Prediction ◽

Classification And Regression ◽

The Cost

Download Full-text

GCS-Pupil Score Has a Stronger Association with Mortality and Poor Functional Outcome than GCS Alone in Pediatric Severe Traumatic Brain Injury

Pediatric Neurosurgery ◽

10.1159/000517330 ◽

2021 ◽

pp. 1-8

Author(s):

Binod Balakrishnan ◽

Heather VanDongen-Trimmer ◽

Irene Kim ◽

Sheila J. Hanson ◽

Liyun Zhang ◽

...

Keyword(s):

Traumatic Brain Injury ◽

Brain Injury ◽

Functional Outcome ◽

Cerebral Performance Category ◽

Regression Tree ◽

Classification And Regression Tree ◽

Continuous Variables ◽

Poor Functional Outcome ◽

Severity Scores ◽

Pupil Reactivity

Background: The Glasgow Coma Scale (GCS), used to classify the severity of traumatic brain injury (TBI), is associated with mortality and functional outcomes. However, GCS can be affected by sedation and neuromuscular blockade. GCS-Pupil (GCS-P) score, calculated as GCS minus Pupil Reactivity Score (PRS), was shown to better predict outcomes in a retrospective cohort of adult TBI patients. We evaluated the applicability of GCS-P to a large retrospective pediatric severe TBI (sTBI) cohort. Methods: Admissions to pediatric intensive care units in the Virtual Pediatric Systems (VPS, LLC) database from 2010 to 2015 with sTBI were included. We collected GCS, PRS (number of nonreactive pupils), cardiac arrest, abusive head trauma status, illness severity scores, pediatric cerebral performance category (PCPC) score, and mortality. GCS-P was calculated as GCS minus PRS. χ2 or Fisher’s exact test and Mann-Whitney U test compared categorical and continuous variables, respectively. Classification and regression tree analysis identified thresholds of GCS-P and GCS along with other independent factors which were further examined using multivariable regression analysis to identify factors independently associated with mortality and unfavorable PCPC at PICU discharge. Results: Among the 2,682 patients included in the study, mortality was 23%, increasing from 4.7% for PRS = 0 to 80% for PRS = 2. GCS-P identified more severely injured patients with GCS-P scores 1 and 2 who had worse outcomes. GCS-P ≤ 2 had higher odds for mortality, OR = 68.4 (95% CI = 50.6–92.4) and unfavorable PCPC, OR = 17.3 (8.1, 37.0) compared to GCS ≤ 5. GCS-P ≤ 2 also had higher specificity and positive predictive value for both mortality and unfavorable PCPC compared to GCS ≤ 5. Conclusions: GCS-P, by incorporating pupil reactivity to GCS scoring, is more strongly associated with mortality and poor functional outcome at PICU discharge in children with sTBI.

Download Full-text