scholarly journals Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing (Preprint)

2020 ◽  
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

BACKGROUND The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. OBJECTIVE This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. RESULTS A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

10.2196/18910 ◽  
2020 ◽  
Vol 8 (7) ◽  
pp. e18910
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.


2021 ◽  
Vol 10 (6) ◽  
pp. 3794-3801
Author(s):  
Yusuf Aliyu Adamu

Malaria is a life-threatening disease that leads to death globally, its early prediction is necessary for preventing the rapid transmission. In this work, an enhanced ensemble learning approach for predicting malaria outbreaks is suggested. Using a mean-based splitting strategy, the dataset is randomly partitioned into smaller groups. The splits are then modelled using a classification and regression tree, and an accuracy-based weighted aging classifier ensemble is used to construct a homogenous ensemble from the several Classification and Regression Tree models. This approach ensures higher performance is achieved. Seven different Algorithms were tested and one ensemble method is used which combines all the seven classifiers together and finally, the accuracy, precision, and sensitivity achieved for the proposed method is 93%, 92%, and 100% respectively, which outperformed better than machine learning classifiers and ensemble method used in this research. The correlation between the variables used is established and how each factor contributes to the malaria incidence. The result indicates that malaria outbreaks can be predicted successfully using the suggested technique.


2021 ◽  
pp. 1-18
Author(s):  
Shashikant Rathod ◽  
Leena Phadke ◽  
Uttam Chaskar ◽  
Chetankumar Patil

BACKGROUND: According to the World Health Organization, one in ten adults will have Type 2 Diabetes Mellitus (T2DM) in the next few years. Autonomic dysfunction is one of the significant complications of T2DM. Autonomic dysfunction is usually assessed by standard Ewing’s test and resting Heart Rate Variability (HRV) indices. OBJECTIVE: Resting HRV has limited use in screening due to its large intra and inter-individual variations. Therefore, a combined approach of resting and orthostatic challenge HRV measurement with a machine learning technique was used in the present study. METHODS: A total of 213 subjects of both genders between 20 to 70 years of age participated in this study from March 2018 to December 2019 at Smt. Kashibai Navale Medical College and General Hospital (SKNMCGH) in Pune, India. The volunteers were categorized according to their glycemic status as control (n= 51 Euglycemic) and T2DM (n= 162). The short-term ECG signal in the resting and after an orthostatic challenge was recorded. The HRV indices were extracted from the ECG signal as per HRV-Taskforce guidelines. RESULTS: We observed a significant difference in time, frequency, and non-linear resting HRV indices between the control and T2DM groups. A blunted autonomic response to an orthostatic challenge quantified by percentage difference was observed in T2DM compared to the control group. HRV patterns during rest and the orthostatic challenge were extracted by various machine learning algorithms. The classification and regression tree (CART) model has shown better performance among all the machine learning algorithms. It has shown an accuracy of 84.04%, the sensitivity of 89.51%, a specificity of 66.67%, with an Area Under Receiver Operating Characteristic Curve (AUC) of 0.78 compared to resting HRV alone with 75.12% accuracy, 86.42% sensitivity, 39.22% specificity, with an AUC of 0.63 for differentiating autonomic dysfunction in non-diabetic control and T2DM. CONCLUSION: It was possible to develop a Classification and Regression Tree (CART) model to detect autonomic dysfunction. The technique of percentage difference between resting and orthostatic challenge HRV indicates the blunted autonomic response. The developed CART model can differentiate the autonomic dysfunction using both resting and orthostatic challenge HRV data compared to only resting HRV data in T2DM. Thus, monitoring HRV parameters using the CART model during rest and after orthostatic challenge may be a better alternative to detect autonomic dysfunction in T2DM as against only resting HRV.


2014 ◽  
pp. 115-123
Author(s):  
Rachid Beghdad

The purpose of this study is to identify some higher-level KDD features, and to train the resulting set with an appropriate machine learning technique, in order to classify and predict attacks. To achieve that, a two-steps approach is proposed. Firstly, the Fisher’s ANOVA technique was used to deduce the important features. Secondly, 4 types of classification trees: ID3, C4.5, classification and regression tree (CART), and random tree (RnDT), were tested to classify and detect attacks. According to our tests, the RndT leads to the better results. That is why we will present here the classification and prediction results of this technique in details. Some of the remaining results will be used later to make comparisons. We used the KDD’99 data sets to evaluate the considered algorithms. For these evaluations, only the four attack categories’ case was considered. Our simulations show the efficiency of our approach, and show also that it is very competitive with some similar previous works.


Author(s):  
Pardomuan Robinson Sihombing ◽  
Istiqomatul Fajriyah Yuliati

Penelitian ini akan mengkaji penerapan beberapa metode machine learning dengan memperhatikan kasus imbalanced data dalam pemodelan klasifikasi untuk penentuan risiko kejadian bayi dengan BBLR yang diharapkan dapat menjadi solusi dalam menurunkan kelahiran bayi dengan BBLR di Indonesia. Adapun metode meachine learning yang digunakan adalah Classification and Regression Tree (CART), Naïve Bayes, Random Forest dan Support Vector Machine (SVM). Pemodelan klasifikasi dengan menggunakan teknik resample pada kasus imbalanced data dan set data besar terbukti mampu meningkatkan ketepatan klasifikasi khususnya terhadap kelas minoritas yang dapat diihat dari nilai sensitivity yang tinggi dibandingkan data asli (tanpa treatment). Selanjutnya, dari kelima model klasifikasi yang iuji menunjukkan bahwa model random forest memberikan kinerja terbaik berdasarkan nilai sensitivity, specificity, G-mean dan AUC tertinggi. Variabel terpenting/paling berpengaruh dalam klasifikasi resiko kejadian BBLR adalah jarak dan urutan kelahiran, pemeriksaan kehamilan, dan umur ibu


2021 ◽  
Author(s):  
Fida Dankar ◽  
Mahmoud K. Ibrahim ◽  
Leila Ismail

BACKGROUND Synthetic datasets are gradually emerging as solutions for fast and inclusive health data sharing. Multiple synthetic data generators have been introduced in the last decade fueled by advancement in machine learning, yet their utility is not well understood. Few recent papers tried to compare the utility of synthetic data generators, each focused on different evaluation metrics and presented conclusions targeted at specific analysis. OBJECTIVE This work aims to understand the overall utility (referred to as quality) of four recent synthetic data generators by identifying multiple criteria for high-utility for synthetic data. METHODS We investigate commonly used utility metrics for masked data evaluation and classify them into criteria/categories depending on the function they attempt to preserve: attribute fidelity, bivariate fidelity, population fidelity, and application fidelity. Then we chose a representative metric from each of the identified categories based on popularity and consistency. The set of metrics together, referred to as quality criteria, are used to evaluate the overall utility of four recent synthetic data generators across 19 datasets of different sizes and feature counts. Moreover, correlations between the identified metrics are investigated in an attempt to streamline synthetic data utility. RESULTS Our results indicate that a non-parametric machine learning synthetic data generator (Synthpop) provides the best utility values across all quality criteria along with the highest stability. It displays the best overall accuracy in supervised machine learning and often agrees with real dataset on the learning model with the highest accuracy. On another front, our results suggest no strong correlation between the different metrics, which implies that all categories/dimensions are required when evaluating the overall utility of synthetic data. CONCLUSIONS The paper used four quality criteria to inform on the synthesizer with the best overall utility. The results are promising with small decreases in accuracy observed from the winning synthesizer when tested with real datasets (in comparison with models trained on real data). Further research into one (overall) quality measure would greatly help data holders in optimizing the utility of the released dataset.


Sign in / Sign up

Export Citation Format

Share Document