Application of Bayesian networks to generate synthetic health data

Abstract Objective This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. Materials and Methods We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. Results Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. Discussion Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. Conclusion We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.

Download Full-text

Improving Data Transparency and Accessibility in the Research Community through the Construction of Accurately Simulated Time-to-Event Datasets

10.21203/rs.3.rs-1170348/v1 ◽

2021 ◽

Author(s):

Aiden Smith ◽

Paul Lambert ◽

Mark Rutherford

Keyword(s):

Data Privacy ◽

Synthetic Data ◽

Original Data ◽

Survival Times ◽

Time To Event ◽

Research Information ◽

Cancer Dataset ◽

Individual Level ◽

Original Dataset ◽

Real World Datasets

Abstract BackgroundA lack of availability of data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, accurate time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on.MethodsThis paper presents methods to accurately replicate the covariate patterns and survival times found in real-world datasets using simulation techniques, without compromising individual patient identifiability. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to simulate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented.ResultsWe successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research. ConclusionsWe evaluate the effectiveness of the simulation methods for constructing synthetic data, as well as providing evidence that it is almost impossible that a given patient from the original data could be identified from their individual unique date information. Simulated datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.

Download Full-text

Adversarial Attack on Machine Learning Models

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f1088.0486s419 ◽

2019 ◽

Vol 8 (6S4) ◽

pp. 431-434

Keyword(s):

Machine Learning ◽

Statistical Tests ◽

Original Data ◽

Network Intrusion Detection ◽

Learning Models ◽

Malware Classification ◽

Network Intrusion ◽

Adversarial Examples ◽

Adversarial Attack ◽

Machine Learning Models

Machine Learning (ML) models are applied in a variety of tasks such as network intrusion detection or malware classification. Yet, these models are vulnerable to a class of malicious inputs known as adversarial examples. These are slightly perturbed inputs that are classified incorrectly by the ML model. The mitigation of these adversarial inputs remains an open problem. As a step towards understanding adversarial examples, we show that they are not drawn from the same distribution than the original data, and can thus be detected using statistical tests. Using this knowledge, we introduce a complimentary approach to identify specific inputs that are adversarial. Specifically, we augment our ML model with an additional output, in which the model is trained to classify all adversarial inputs.

Download Full-text

Privacy and Trust Redefined in Federated Machine Learning

Machine Learning and Knowledge Extraction ◽

10.3390/make3020017 ◽

2021 ◽

Vol 3 (2) ◽

pp. 333-356

Author(s):

Pavlos Papadopoulos ◽

Will Abramson ◽

Adam J. Hall ◽

Nikolaos Pitropakis ◽

William J. Buchanan

Keyword(s):

Mental Health ◽

Machine Learning ◽

Data Privacy ◽

Communication Channels ◽

Privacy Preserving ◽

Health Data ◽

Proof Of Concept ◽

Sensitive Data ◽

Privacy Issue ◽

Highly Sensitive

A common privacy issue in traditional machine learning is that data needs to be disclosed for the training procedures. In situations with highly sensitive data such as healthcare records, accessing this information is challenging and often prohibited. Luckily, privacy-preserving technologies have been developed to overcome this hurdle by distributing the computation of the training and ensuring the data privacy to their owners. The distribution of the computation to multiple participating entities introduces new privacy complications and risks. In this paper, we present a privacy-preserving decentralised workflow that facilitates trusted federated learning among participants. Our proof-of-concept defines a trust framework instantiated using decentralised identity technologies being developed under Hyperledger projects Aries/Indy/Ursa. Only entities in possession of Verifiable Credentials issued from the appropriate authorities are able to establish secure, authenticated communication channels authorised to participate in a federated learning workflow related to mental health data.

Download Full-text

Privacy Preserving Classification of EEG Data Using Machine Learning and Homomorphic Encryption

Applied Sciences ◽

10.3390/app11167360 ◽

2021 ◽

Vol 11 (16) ◽

pp. 7360

Author(s):

Andreea Bianca Popescu ◽

Ioana Antonia Taca ◽

Cosmin Ioan Nita ◽

Anamaria Vizitiu ◽

Robert Demeter ◽

...

Keyword(s):

Machine Learning ◽

Data Privacy ◽

Homomorphic Encryption ◽

Synthetic Data ◽

Privacy Preserving ◽

Supervised Machine Learning ◽

Computational Time ◽

Small Integer ◽

Encrypted Data ◽

The Impact

Data privacy is a major concern when accessing and processing sensitive medical data. A promising approach among privacy-preserving techniques is homomorphic encryption (HE), which allows for computations to be performed on encrypted data. Currently, HE still faces practical limitations related to high computational complexity, noise accumulation, and sole applicability the at bit or small integer values level. We propose herein an encoding method that enables typical HE schemes to operate on real-valued numbers of arbitrary precision and size. The approach is evaluated on two real-world scenarios relying on EEG signals: seizure detection and prediction of predisposition to alcoholism. A supervised machine learning-based approach is formulated, and training is performed using a direct (non-iterative) fitting method that requires a fixed and deterministic number of steps. Experiments on synthetic data of varying size and complexity are performed to determine the impact on runtime and error accumulation. The computational time for training the models increases but remains manageable, while the inference time remains in the order of milliseconds. The prediction performance of the models operating on encoded and encrypted data is comparable to that of standard models operating on plaintext data.

Download Full-text

Deep Learning in Disease Diagnosis: Models and Datasets

Current Bioinformatics ◽

10.2174/1574893615999201002124021 ◽

2020 ◽

Vol 15 ◽

Author(s):

Deeksha Saxena ◽

Mohammed Haris Siddiqui ◽

Rajnish Kumar

Keyword(s):

Biological Sciences ◽

Machine Learning ◽

Deep Learning ◽

Disease Diagnosis ◽

Learning Models ◽

Data Types ◽

Related Data ◽

Abstract Level ◽

Experimental Validations ◽

Selection Of

Background: Deep learning (DL) is an Artificial neural network-driven framework with multiple levels of representation for which non-linear modules combined in such a way that the levels of representation can be enhanced from lower to a much abstract level. Though DL is used widely in almost every field, it has largely brought a breakthrough in biological sciences as it is used in disease diagnosis and clinical trials. DL can be clubbed with machine learning, but at times both are used individually as well. DL seems to be a better platform than machine learning as the former does not require an intermediate feature extraction and works well with larger datasets. DL is one of the most discussed fields among the scientists and researchers these days for diagnosing and solving various biological problems. However, deep learning models need some improvisation and experimental validations to be more productive. Objective: To review the available DL models and datasets that are used in disease diagnosis. Methods: Available DL models and their applications in disease diagnosis were reviewed discussed and tabulated. Types of datasets and some of the popular disease related data sources for DL were highlighted. Results: We have analyzed the frequently used DL methods, data types and discussed some of the recent deep learning models used for solving different biological problems. Conclusion: The review presents useful insights about DL methods, data types, selection of DL models for the disease diagnosis.

Download Full-text

Machine Learning Approach for Confirmation of COVID-19 Cases: Positive, Negative, Death and Release (Preprint)

10.2196/preprints.19526 ◽

2020 ◽

Cited By ~ 1

Author(s):

Samir Bandyopadhyay Sr ◽

SHAWNI DUTTA ◽

SHAWNI DUTTA

Keyword(s):

Machine Learning ◽

Short Term Memory ◽

Health Workers ◽

Original Data ◽

High Rate ◽

Learning Approach ◽

Good Decision ◽

Number Of Patients ◽

Machine Learning Approach ◽

Death Cases

BACKGROUND In recent days, Covid-19 coronavirus has been an immense impact on social, economic fields in the world. The objective of this study determines if it is feasible to use machine learning method to evaluate how much prediction results are close to original data related to Confirmed-Negative-Released-Death cases of Covid-19. For this purpose, a verification method is proposed in this paper that uses the concept of Deep-learning Neural Network. In this framework, Long short-term memory (LSTM) and Gated Recurrent Unit (GRU) are also assimilated finally for training the dataset and the prediction results are tally with the results predicted by clinical doctors. The prediction results are validated against the original data based on some predefined metric. The experimental results showcase that the proposed approach is useful in generating suitable results based on the critical disease outbreak. It also helps doctors to recheck further verification of virus by the proposed method. The outbreak of Coronavirus has the nature of exponential growth and so it is difficult to control with limited clinical persons for handling a huge number of patients with in a reasonable time. So it is necessary to build an automated model, based on machine learning approach, for corrective measure after the decision of clinical doctors. It could be a promising supplementary confirmation method for frontline clinical doctors. The proposed method has a high prediction rate and works fast for probable accurate identification of the disease. The performance analysis shows that a high rate of accuracy is obtained by the proposed method. OBJECTIVE Validation of COVID-19 disease METHODS Machine Learning RESULTS 90% CONCLUSIONS The combined LSTM-GRU based RNN model provides a comparatively better results in terms of prediction of confirmed, released, negative, death cases on the data. This paper presented a novel method that could recheck occurred cases of COVID-19 automatically. The data driven RNN based model is capable of providing automated tool for confirming, estimating the current position of this pandemic, assessing the severity, and assisting government and health workers to act for good decision making policy. It could be a promising supplementary rechecking method for frontline clinical doctors. It is now essential for improving the accuracy of detection process. CLINICALTRIAL 2020-04-03 3:22:36 PM

Download Full-text

Federated Learning in a Medical Context: A Systematic Literature Review

ACM Transactions on Internet Technology ◽

10.1145/3412357 ◽

2021 ◽

Vol 21 (2) ◽

pp. 1-31

Author(s):

Bjarne Pfitzner ◽

Nico Steckhan ◽

Bert Arnrich

Keyword(s):

Machine Learning ◽

Literature Review ◽

Systematic Literature Review ◽

Data Privacy ◽

Research Area ◽

Learning Models ◽

Related Data ◽

Private Data ◽

Large Databases ◽

Machine Learning Models

Data privacy is a very important issue. Especially in fields like medicine, it is paramount to abide by the existing privacy regulations to preserve patients’ anonymity. However, data is required for research and training machine learning models that could help gain insight into complex correlations or personalised treatments that may otherwise stay undiscovered. Those models generally scale with the amount of data available, but the current situation often prohibits building large databases across sites. So it would be beneficial to be able to combine similar or related data from different sites all over the world while still preserving data privacy. Federated learning has been proposed as a solution for this, because it relies on the sharing of machine learning models, instead of the raw data itself. That means private data never leaves the site or device it was collected on. Federated learning is an emerging research area, and many domains have been identified for the application of those methods. This systematic literature review provides an extensive look at the concept of and research into federated learning and its applicability for confidential healthcare datasets.

Download Full-text

Construction of a multi-source heterogeneous hybrid platform for big data

Journal of Computational Methods in Sciences and Engineering ◽

10.3233/jcm-215138 ◽

2021 ◽

pp. 1-10

Author(s):

Ying Wang ◽

Yiding Liu ◽

Minna Xia

Keyword(s):

Big Data ◽

Data Analysis ◽

Forest Fire ◽

Original Data ◽

Big Data Analysis ◽

Multiple Sources ◽

Data Types ◽

Fire Monitoring ◽

Data Platform

Big data is featured by multiple sources and heterogeneity. Based on the big data platform of Hadoop and spark, a hybrid analysis on forest fire is built in this study. This platform combines the big data analysis and processing technology, and learns from the research results of different technical fields, such as forest fire monitoring. In this system, HDFS of Hadoop is used to store all kinds of data, spark module is used to provide various big data analysis methods, and visualization tools are used to realize the visualization of analysis results, such as Echarts, ArcGIS and unity3d. Finally, an experiment for forest fire point detection is designed so as to corroborate the feasibility and effectiveness, and provide some meaningful guidance for the follow-up research and the establishment of forest fire monitoring and visualized early warning big data platform. However, there are two shortcomings in this experiment: more data types should be selected. At the same time, if the original data can be converted to XML format, the compatibility is better. It is expected that the above problems can be solved in the follow-up research.

Download Full-text

Federated Quantum Machine Learning

Entropy ◽

10.3390/e23040460 ◽

2021 ◽

Vol 23 (4) ◽

pp. 460

Author(s):

Samuel Yen-Chi Chen ◽

Shinjae Yoo

Keyword(s):

Machine Learning ◽

Data Privacy ◽

Research Direction ◽

Future Research ◽

Quantum Computers ◽

Training Time ◽

Quantum Neural Network ◽

Distributed Training ◽

Machine Learning Model ◽

Quantum Machine Learning

Distributed training across several quantum computers could significantly improve the training time and if we could share the learned model, not the data, it could potentially improve the data privacy as the training would happen where the data is located. One of the potential schemes to achieve this property is the federated learning (FL), which consists of several clients or local nodes learning on their own data and a central node to aggregate the models collected from those local nodes. However, to the best of our knowledge, no work has been done in quantum machine learning (QML) in federation setting yet. In this work, we present the federated training on hybrid quantum-classical machine learning models although our framework could be generalized to pure quantum machine learning model. Specifically, we consider the quantum neural network (QNN) coupled with classical pre-trained convolutional model. Our distributed federated learning scheme demonstrated almost the same level of trained model accuracies and yet significantly faster distributed training. It demonstrates a promising future research direction for scaling and privacy aspects.

Download Full-text

NLOS Multipath Classification of GNSS Signal Correlation Output Using Machine Learning

Sensors ◽

10.3390/s21072503 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2503

Author(s):

Taro Suzuki ◽

Yoshiharu Amano

Keyword(s):

Machine Learning ◽

Satellite System ◽

Training Data ◽

Support Vector ◽

Positioning Errors ◽

Automated Method ◽

Global Navigation Satellite ◽

Better Than ◽

Signal Correlation

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.

Download Full-text