Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing (Preprint)

Mapping Intimacies ◽

10.2196/preprints.18910 ◽

2020 ◽

Author(s):

Debbie Rankin ◽

Michaela Black ◽

Raymond Bond ◽

Jonathan Wallace ◽

Maurice Mulvenna ◽

...

Keyword(s):

Machine Learning ◽

Health Care ◽

Bayesian Network ◽

Synthetic Data ◽

Regression Tree ◽

Real Data ◽

Classification And Regression Tree ◽

Supervised Machine Learning ◽

Statistical Disclosure ◽

Classification And Regression

BACKGROUND The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. OBJECTIVE This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. RESULTS A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Download Full-text

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

JMIR Medical Informatics ◽

10.2196/18910 ◽

2020 ◽

Vol 8 (7) ◽

pp. e18910

Author(s):

Debbie Rankin ◽

Michaela Black ◽

Raymond Bond ◽

Jonathan Wallace ◽

Maurice Mulvenna ◽

...

Keyword(s):

Machine Learning ◽

Health Care ◽

Bayesian Network ◽

Synthetic Data ◽

Regression Tree ◽

Real Data ◽

Classification And Regression Tree ◽

Supervised Machine Learning ◽

Statistical Disclosure ◽

Classification And Regression

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Download Full-text

GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran

Environmental Monitoring and Assessment ◽

10.1007/s10661-015-5049-6 ◽

2015 ◽

Vol 188 (1) ◽

Cited By ~ 208

Author(s):

Seyed Amir Naghibi ◽

Hamid Reza Pourghasemi ◽

Barnali Dixon

Keyword(s):

Machine Learning ◽

Regression Tree ◽

Groundwater Potential ◽

Classification And Regression Tree ◽

Learning Models ◽

Boosted Regression Tree ◽

Potential Mapping ◽

Classification And Regression ◽

Groundwater Potential Mapping ◽

Machine Learning Models

Download Full-text

MALARIA PREDICTION MODEL USING ADVANCED ENSEMBLE MACHINE LEARNING TECHNIQUES

Journal of medical pharmaceutical and allied sciences ◽

10.22270/jmpas.v10i6.1701 ◽

2021 ◽

Vol 10 (6) ◽

pp. 3794-3801

Author(s):

Yusuf Aliyu Adamu

Keyword(s):

Machine Learning ◽

Malaria Incidence ◽

Regression Tree ◽

Ensemble Method ◽

Classification And Regression Tree ◽

Machine Learning Techniques ◽

Ensemble Machine Learning ◽

Suggested Technique ◽

Life Threatening ◽

Classification And Regression

Malaria is a life-threatening disease that leads to death globally, its early prediction is necessary for preventing the rapid transmission. In this work, an enhanced ensemble learning approach for predicting malaria outbreaks is suggested. Using a mean-based splitting strategy, the dataset is randomly partitioned into smaller groups. The splits are then modelled using a classification and regression tree, and an accuracy-based weighted aging classifier ensemble is used to construct a homogenous ensemble from the several Classification and Regression Tree models. This approach ensures higher performance is achieved. Seven different Algorithms were tested and one ensemble method is used which combines all the seven classifiers together and finally, the accuracy, precision, and sensitivity achieved for the proposed method is 93%, 92%, and 100% respectively, which outperformed better than machine learning classifiers and ensemble method used in this research. The correlation between the variables used is established and how each factor contributes to the malaria incidence. The result indicates that malaria outbreaks can be predicted successfully using the suggested technique.

Download Full-text

Heart Rate Variability measured during rest and after orthostatic challenge to detect autonomic dysfunction in Type 2 Diabetes Mellitus using the Classification and Regression Tree model

Technology and Health Care ◽

10.3233/thc-213048 ◽

2021 ◽

pp. 1-18

Author(s):

Shashikant Rathod ◽

Leena Phadke ◽

Uttam Chaskar ◽

Chetankumar Patil

Keyword(s):

Machine Learning ◽

Autonomic Dysfunction ◽

Regression Tree ◽

Machine Learning Algorithms ◽

Classification And Regression Tree ◽

Autonomic Response ◽

Orthostatic Challenge ◽

Percentage Difference ◽

Classification And Regression ◽

Cart Model

BACKGROUND: According to the World Health Organization, one in ten adults will have Type 2 Diabetes Mellitus (T2DM) in the next few years. Autonomic dysfunction is one of the significant complications of T2DM. Autonomic dysfunction is usually assessed by standard Ewing’s test and resting Heart Rate Variability (HRV) indices. OBJECTIVE: Resting HRV has limited use in screening due to its large intra and inter-individual variations. Therefore, a combined approach of resting and orthostatic challenge HRV measurement with a machine learning technique was used in the present study. METHODS: A total of 213 subjects of both genders between 20 to 70 years of age participated in this study from March 2018 to December 2019 at Smt. Kashibai Navale Medical College and General Hospital (SKNMCGH) in Pune, India. The volunteers were categorized according to their glycemic status as control (n= 51 Euglycemic) and T2DM (n= 162). The short-term ECG signal in the resting and after an orthostatic challenge was recorded. The HRV indices were extracted from the ECG signal as per HRV-Taskforce guidelines. RESULTS: We observed a significant difference in time, frequency, and non-linear resting HRV indices between the control and T2DM groups. A blunted autonomic response to an orthostatic challenge quantified by percentage difference was observed in T2DM compared to the control group. HRV patterns during rest and the orthostatic challenge were extracted by various machine learning algorithms. The classification and regression tree (CART) model has shown better performance among all the machine learning algorithms. It has shown an accuracy of 84.04%, the sensitivity of 89.51%, a specificity of 66.67%, with an Area Under Receiver Operating Characteristic Curve (AUC) of 0.78 compared to resting HRV alone with 75.12% accuracy, 86.42% sensitivity, 39.22% specificity, with an AUC of 0.63 for differentiating autonomic dysfunction in non-diabetic control and T2DM. CONCLUSION: It was possible to develop a Classification and Regression Tree (CART) model to detect autonomic dysfunction. The technique of percentage difference between resting and orthostatic challenge HRV indicates the blunted autonomic response. The developed CART model can differentiate the autonomic dysfunction using both resting and orthostatic challenge HRV data compared to only resting HRV data in T2DM. Thus, monitoring HRV parameters using the CART model during rest and after orthostatic challenge may be a better alternative to detect autonomic dysfunction in T2DM as against only resting HRV.

Download Full-text

SELECTING KDD FEATURES AND USING RANDOM CLASSIFICATION TREE FOR PREDICTING ATTACKS

International Journal of Computing ◽

10.47839/ijc.6.3.460 ◽

2014 ◽

pp. 115-123

Author(s):

Rachid Beghdad

Keyword(s):

Machine Learning ◽

Classification Tree ◽

Regression Tree ◽

Classification Trees ◽

Classification And Regression Tree ◽

Data Sets ◽

Machine Learning Technique ◽

Learning Technique ◽

Classification And Regression ◽

Anova Technique

The purpose of this study is to identify some higher-level KDD features, and to train the resulting set with an appropriate machine learning technique, in order to classify and predict attacks. To achieve that, a two-steps approach is proposed. Firstly, the Fisher’s ANOVA technique was used to deduce the important features. Secondly, 4 types of classification trees: ID3, C4.5, classification and regression tree (CART), and random tree (RnDT), were tested to classify and detect attacks. According to our tests, the RndT leads to the better results. That is why we will present here the classification and prediction results of this technique in details. Some of the remaining results will be used later to make comparisons. We used the KDD’99 data sets to evaluate the considered algorithms. For these evaluations, only the four attack categories’ case was considered. Our simulations show the efficiency of our approach, and show also that it is very competitive with some similar previous works.

Download Full-text

Penerapan Metode Machine Learning dalam Klasifikasi Risiko Kejadian Berat Badan Lahir Rendah di Indonesia

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v20i2.1174 ◽

2021 ◽

Vol 20 (2) ◽

pp. 417-426

Author(s):

Pardomuan Robinson Sihombing ◽

Istiqomatul Fajriyah Yuliati

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Naive Bayes ◽

Regression Tree ◽

Imbalanced Data ◽

Classification And Regression Tree ◽

Support Vector ◽

Classification And Regression ◽

Sensitivity Specificity

Penelitian ini akan mengkaji penerapan beberapa metode machine learning dengan memperhatikan kasus imbalanced data dalam pemodelan klasifikasi untuk penentuan risiko kejadian bayi dengan BBLR yang diharapkan dapat menjadi solusi dalam menurunkan kelahiran bayi dengan BBLR di Indonesia. Adapun metode meachine learning yang digunakan adalah Classification and Regression Tree (CART), Naïve Bayes, Random Forest dan Support Vector Machine (SVM). Pemodelan klasifikasi dengan menggunakan teknik resample pada kasus imbalanced data dan set data besar terbukti mampu meningkatkan ketepatan klasifikasi khususnya terhadap kelas minoritas yang dapat diihat dari nilai sensitivity yang tinggi dibandingkan data asli (tanpa treatment). Selanjutnya, dari kelima model klasifikasi yang iuji menunjukkan bahwa model random forest memberikan kinerja terbaik berdasarkan nilai sensitivity, specificity, G-mean dan AUC tertinggi. Variabel terpenting/paling berpengaruh dalam klasifikasi resiko kejadian BBLR adalah jarak dan urutan kelahiran, pemeriksaan kehamilan, dan umur ibu

Download Full-text

Learning Bayesian network structure based on the classification and regression tree

BMC Neuroscience ◽

10.1186/1471-2202-9-s1-p71 ◽

2008 ◽

Vol 9 (S1) ◽

Author(s):

Yan Sun ◽

Yi-Yuan Tang

Keyword(s):

Bayesian Network ◽

Network Structure ◽

Regression Tree ◽

Classification And Regression Tree ◽

Bayesian Network Structure ◽

Classification And Regression

Download Full-text

A multi-dimensional quality comparison of synthetic data generators (Preprint)

10.2196/preprints.34269 ◽

2021 ◽

Author(s):

Fida Dankar ◽

Mahmoud K. Ibrahim ◽

Leila Ismail

Keyword(s):

Machine Learning ◽

Synthetic Data ◽

Quality Criteria ◽

Real Data ◽

Supervised Machine Learning ◽

Specific Analysis ◽

Data Generator ◽

High Utility ◽

Synthetic Datasets ◽

Utility Metrics

BACKGROUND Synthetic datasets are gradually emerging as solutions for fast and inclusive health data sharing. Multiple synthetic data generators have been introduced in the last decade fueled by advancement in machine learning, yet their utility is not well understood. Few recent papers tried to compare the utility of synthetic data generators, each focused on different evaluation metrics and presented conclusions targeted at specific analysis. OBJECTIVE This work aims to understand the overall utility (referred to as quality) of four recent synthetic data generators by identifying multiple criteria for high-utility for synthetic data. METHODS We investigate commonly used utility metrics for masked data evaluation and classify them into criteria/categories depending on the function they attempt to preserve: attribute fidelity, bivariate fidelity, population fidelity, and application fidelity. Then we chose a representative metric from each of the identified categories based on popularity and consistency. The set of metrics together, referred to as quality criteria, are used to evaluate the overall utility of four recent synthetic data generators across 19 datasets of different sizes and feature counts. Moreover, correlations between the identified metrics are investigated in an attempt to streamline synthetic data utility. RESULTS Our results indicate that a non-parametric machine learning synthetic data generator (Synthpop) provides the best utility values across all quality criteria along with the highest stability. It displays the best overall accuracy in supervised machine learning and often agrees with real dataset on the learning model with the highest accuracy. On another front, our results suggest no strong correlation between the different metrics, which implies that all categories/dimensions are required when evaluating the overall utility of synthetic data. CONCLUSIONS The paper used four quality criteria to inform on the synthesizer with the best overall utility. The results are promising with small decreases in accuracy observed from the winning synthesizer when tested with real datasets (in comparison with models trained on real data). Further research into one (overall) quality measure would greatly help data holders in optimizing the utility of the released dataset.

Download Full-text

Predicting the Recurrence of Child Maltreatment: A Classification and Regression Tree Analysis

PsycEXTRA Dataset ◽

10.1037/e517322011-501 ◽

2007 ◽

Author(s):

Eve Sledjeski ◽

Lisa Dierker ◽

Rebecca Brigham ◽

Eileen Breslin

Keyword(s):

Child Maltreatment ◽

Regression Tree ◽

Classification And Regression Tree ◽

Tree Analysis ◽

Regression Tree Analysis ◽

Classification And Regression

Download Full-text

Determining late predictors of outcome for acetaminophen- induced acute liver failure using classification and regression tree modeling analysis

Critical Care ◽

10.1186/cc13389 ◽

2014 ◽

Vol 18 (Suppl 1) ◽

pp. P199

Author(s):

C Karvellas ◽

J Speiser ◽

W Lee

Keyword(s):

Liver Failure ◽

Acute Liver Failure ◽

Regression Tree ◽

Classification And Regression Tree ◽

Predictors Of Outcome ◽

Modeling Analysis ◽

Tree Modeling ◽

Classification And Regression ◽

Regression Tree Modeling

Download Full-text