Web-Based Privacy-Preserving Multicenter Medical Data Analysis Tools Via Threshold Homomorphic Encryption: Design and Development Study

Background Data sharing in multicenter medical research can improve the generalizability of research, accelerate progress, enhance collaborations among institutions, and lead to new discoveries from data pooled from multiple sources. Despite these benefits, many medical institutions are unwilling to share their data, as sharing may cause sensitive information to be leaked to researchers, other institutions, and unauthorized users. Great progress has been made in the development of secure machine learning frameworks based on homomorphic encryption in recent years; however, nearly all such frameworks use a single secret key and lack a description of how to securely evaluate the trained model, which makes them impractical for multicenter medical applications. Objective The aim of this study is to provide a privacy-preserving machine learning protocol for multiple data providers and researchers (eg, logistic regression). This protocol allows researchers to train models and then evaluate them on medical data from multiple sources while providing privacy protection for both the sensitive data and the learned model. Methods We adapted a novel threshold homomorphic encryption scheme to guarantee privacy requirements. We devised new relinearization key generation techniques for greater scalability and multiplicative depth and new model training strategies for simultaneously training multiple models through x-fold cross-validation. Results Using a client-server architecture, we evaluated the performance of our protocol. The experimental results demonstrated that, with 10-fold cross-validation, our privacy-preserving logistic regression model training and evaluation over 10 attributes in a data set of 49,152 samples took approximately 7 minutes and 20 minutes, respectively. Conclusions We present the first privacy-preserving multiparty logistic regression model training and evaluation protocol based on threshold homomorphic encryption. Our protocol is practical for real-world use and may promote multicenter medical research to some extent.

Download Full-text

Web-Based Privacy-Preserving Multicenter Medical Data Analysis Tools Via Threshold Homomorphic Encryption: Design and Development Study (Preprint)

10.2196/preprints.22555 ◽

2020 ◽

Author(s):

Yao Lu ◽

Tianshu Zhou ◽

Yu Tian ◽

Shiqiang Zhu ◽

Jingsong Li

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Logistic Regression Model ◽

Cross Validation ◽

Homomorphic Encryption ◽

Privacy Preserving ◽

Medical Data ◽

Multiple Sources ◽

Model Training ◽

Fold Cross Validation

BACKGROUND Data sharing in multicenter medical research can improve the generalizability of research, accelerate progress, enhance collaborations among institutions, and lead to new discoveries from data pooled from multiple sources. Despite these benefits, many medical institutions are unwilling to share their data, as sharing may cause sensitive information to be leaked to researchers, other institutions, and unauthorized users. Great progress has been made in the development of secure machine learning frameworks based on homomorphic encryption in recent years; however, nearly all such frameworks use a single secret key and lack a description of how to securely evaluate the trained model, which makes them impractical for multicenter medical applications. OBJECTIVE The aim of this study is to provide a privacy-preserving machine learning protocol for multiple data providers and researchers (eg, logistic regression). This protocol allows researchers to train models and then evaluate them on medical data from multiple sources while providing privacy protection for both the sensitive data and the learned model. METHODS We adapted a novel threshold homomorphic encryption scheme to guarantee privacy requirements. We devised new relinearization key generation techniques for greater scalability and multiplicative depth and new model training strategies for simultaneously training multiple models through x-fold cross-validation. RESULTS Using a client-server architecture, we evaluated the performance of our protocol. The experimental results demonstrated that, with 10-fold cross-validation, our privacy-preserving logistic regression model training and evaluation over 10 attributes in a data set of 49,152 samples took approximately 7 minutes and 20 minutes, respectively. CONCLUSIONS We present the first privacy-preserving multiparty logistic regression model training and evaluation protocol based on threshold homomorphic encryption. Our protocol is practical for real-world use and may promote multicenter medical research to some extent.

Download Full-text

AN APPLICATION OF 5-FOLD CROSS VALIDATION ON A BINARY LOGISTIC REGRESSION MODEL

Advances and Applications in Statistics ◽

10.17654/as049060443 ◽

2016 ◽

Vol 49 (6) ◽

pp. 443-451

Author(s):

A. M. C. H. Attanayake ◽

D. D. M. Jayasundara ◽

T. S. G. Peiris

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Logistic Regression Model ◽

Cross Validation ◽

Binary Logistic Regression ◽

Binary Logistic Regression Model ◽

Fold Cross Validation

Download Full-text

First-Trimester Screening for Gestational Diabetes Mellitus Based on Maternal Characteristics and History

Fetal Diagnosis and Therapy ◽

10.1159/000369970 ◽

2014 ◽

Vol 38 (1) ◽

pp. 14-21 ◽

Cited By ~ 29

Author(s):

Argyro Syngelaki ◽

Alice Pastides ◽

Reena Kotecha ◽

Alan Wright ◽

Ranjit Akolekar ◽

...

Keyword(s):

Diabetes Mellitus ◽

Logistic Regression ◽

Gestational Diabetes ◽

Gestational Diabetes Mellitus ◽

Logistic Regression Model ◽

Cross Validation ◽

Nice Guidelines ◽

Maternal Characteristics ◽

History Of ◽

Fold Cross Validation

Objectives: To develop and validate a prediction model for gestational diabetes mellitus (GDM) at 11-13 weeks' gestation based on maternal characteristics and history and to compare its performance with the method recommended by the National Institute of Health and Care Excellence (NICE) and five other published prediction models. Methods: A predictive logistic regression model for GDM was developed from 1,827 cases (2.4%) who developed GDM and 73,334 unaffected controls. A 5-fold cross-validation study was performed to validate this model and to compare its performance with those of the NICE guidelines and the previously published models. Results: In the logistic regression model, maternal age, weight, height, racial origin, family history of diabetes, use of ovulation drugs, birth weight, and previous history of GDM were found to be significant predictors of GDM. In screening for GDM in the 5-fold cross-validation study, detection rates (DRs) were higher (p < 0.0001) for the proposed model (DR = 83.2%) than for the NICE guidelines (DR = 77.5%) for a false positive rate of approximately 40% (determined by NICE). The area under the receiver operating characteristic curve of the new model was higher (p < 0.0001) than that of the previous five models (0.823 vs. 0.688-786). Conclusions: Early effective screening for GDM can be achieved based on maternal characteristics and history.

Download Full-text

Improving Geospatial Agreement by Hybrid Optimization in Logistic Regression-Based Landslide Susceptibility Modelling

Frontiers in Earth Science ◽

10.3389/feart.2021.713803 ◽

2021 ◽

Vol 9 ◽

Author(s):

Deliang Sun ◽

Haijia Wen ◽

Jiahui Xu ◽

Yalan Zhang ◽

Danzhou Wang ◽

...

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Landslide Susceptibility ◽

Logistic Regression Model ◽

Cross Validation ◽

Dominant Factor ◽

Mountainous Area ◽

Prediction Ability ◽

Before And After ◽

Fold Cross Validation

This study aims to develop a logistic regression model of landslide susceptibility based on GeoDetector for dominant-factor screening and 10-fold cross validation for training sample optimization. First, Fengjie county, a typical mountainous area, was selected as the study area since it experienced 1,522 landslides from 2001 to 2016. Second, 22 factors were selected as the initial conditioning factors, and a geospatial database was established with a grid of 30 m precision. Factor detection of the geographic detector and the stepwise regression method included in logistic regression were used to screen out the dominant factors from the database. Then, based on the sample dataset with a 1:10 ratio of landslides and nonlandslides, 10-fold cross validation was used to select the optimized sample to train the logistic regression model of landslide susceptibility in the study area. Finally, the accuracy and efficiency of the two models before and after screening out the dominant factors were evaluated and compared. The results showed that the total accuracy of the two models was both more than 0.9, and the area under the curve value of the receiver operating characteristic curve was more than 0.8, indicating that the models before and after screening factor both had high reliability and good prediction ability. Besides, the screened factors had an active leading role in the geospatial distribution of the historical landslide, indicating that the screened dominant factors have individual rationality. Improving the geospatial agreement between landslide susceptibility and actual landslide-prone by the screening of dominant factors and the optimization of the training samples, a simple, efficient, and reliable logistic-regression–based landslide susceptibility model can be constructed.

Download Full-text

Logistic regression model training based on the approximate homomorphic encryption

BMC Medical Genomics ◽

10.1186/s12920-018-0401-7 ◽

2018 ◽

Vol 11 (S4) ◽

Cited By ~ 35

Author(s):

Andrey Kim ◽

Yongsoo Song ◽

Miran Kim ◽

Keewoo Lee ◽

Jung Hee Cheon

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Logistic Regression Model ◽

Homomorphic Encryption ◽

Model Training

Download Full-text

Multi-Party Privacy-Preserving Logistic Regression with Poor Quality Data Filtering for IoT Contributors

Electronics ◽

10.3390/electronics10172049 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2049

Author(s):

Kennedy Edemacu ◽

Jong Wook Kim

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Data Quality ◽

Logistic Regression Model ◽

Homomorphic Encryption ◽

Poor Quality ◽

Privacy Preserving ◽

Quality Data ◽

Data Filtering ◽

Poor Quality Data

Nowadays, the internet of things (IoT) is used to generate data in several application domains. A logistic regression, which is a standard machine learning algorithm with a wide application range, is built on such data. Nevertheless, building a powerful and effective logistic regression model requires large amounts of data. Thus, collaboration between multiple IoT participants has often been the go-to approach. However, privacy concerns and poor data quality are two challenges that threaten the success of such a setting. Several studies have proposed different methods to address the privacy concern but to the best of our knowledge, little attention has been paid towards addressing the poor data quality problems in the multi-party logistic regression model. Thus, in this study, we propose a multi-party privacy-preserving logistic regression framework with poor quality data filtering for IoT data contributors to address both problems. Specifically, we propose a new metric gradient similarity in a distributed setting that we employ to filter out parameters from data contributors with poor quality data. To solve the privacy challenge, we employ homomorphic encryption. Theoretical analysis and experimental evaluations using real-world datasets demonstrate that our proposed framework is privacy-preserving and robust against poor quality data.

Download Full-text

Prediction of K562 Cells Functional Inhibitors Based on Machine Learning Approaches

Current Pharmaceutical Design ◽

10.2174/1381612825666191107092214 ◽

2020 ◽

Vol 25 (40) ◽

pp. 4296-4302 ◽

Cited By ~ 2

Author(s):

Yuan Zhang ◽

Zhenyan Han ◽

Qian Gao ◽

Xiaoyi Bai ◽

Chi Zhang ◽

...

Keyword(s):

Machine Learning ◽

Inclusion Bodies ◽

Cross Validation ◽

Independent Set ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Learning Approaches ◽

Validation Test ◽

Excess Number ◽

Fold Cross Validation

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.

Download Full-text

Privacy Preserving Machine Learning with Homomorphic Encryption and Federated Learning

Future Internet ◽

10.3390/fi13040094 ◽

2021 ◽

Vol 13 (4) ◽

pp. 94

Author(s):

Haokun Fang ◽

Quan Qian

Keyword(s):

Machine Learning ◽

Homomorphic Encryption ◽

Privacy Preserving ◽

Great Success ◽

Learning Framework ◽

Computational Overhead ◽

Important Concern ◽

Speed Up ◽

Key Length ◽

Core Idea

Privacy protection has been an important concern with the great success of machine learning. In this paper, it proposes a multi-party privacy preserving machine learning framework, named PFMLP, based on partially homomorphic encryption and federated learning. The core idea is all learning parties just transmitting the encrypted gradients by homomorphic encryption. From experiments, the model trained by PFMLP has almost the same accuracy, and the deviation is less than 1%. Considering the computational overhead of homomorphic encryption, we use an improved Paillier algorithm which can speed up the training by 25–28%. Moreover, comparisons on encryption key length, the learning network structure, number of learning clients, etc. are also discussed in detail in the paper.

Download Full-text

Study onYang-XuUsing Body Constitution Questionnaire and Blood Variables in Healthy Volunteers

Evidence-based Complementary and Alternative Medicine ◽

10.1155/2016/9437382 ◽

2016 ◽

Vol 2016 ◽

pp. 1-7 ◽

Cited By ~ 7

Author(s):

Hong-Jhang Chen ◽

Yii-Jeng Lin ◽

Pei-Chen Wu ◽

Wei-Hsiang Hsu ◽

Wan-Chung Hu ◽

...

Keyword(s):

Healthy Subjects ◽

Logistic Regression Model ◽

Cross Validation ◽

Blood Biomarkers ◽

Metabolic Characteristics ◽

Body Constitution ◽

Leave One Out ◽

The Relationship ◽

Fold Cross Validation ◽

Blood Variables

Traditional Chinese medicine (TCM) formulates treatment according to body constitution (BC) differentiation. Different constitutions have specific metabolic characteristics and different susceptibility to certain diseases. This study aimed to assess theYang-Xuconstitution using a body constitution questionnaire (BCQ) and clinical blood variables. A BCQ was employed to assess the clinical manifestation ofYang-Xu. The logistic regression model was conducted to explore the relationship between BC scores and biomarkers. Leave-one-out cross-validation (LOOCV) and K-fold cross-validation were performed to evaluate the accuracy of a predictive model in practice. Decision trees (DTs) were conducted to determine the possible relationships between blood biomarkers and BC scores. According to the BCQ analysis, 49% participants without any BC were classified as healthy subjects. Among them, 130 samples were selected for further analysis and divided into two groups. One group comprised healthy subjects without any BC (68%), while subjects of the other group, named as the sub-healthy group, had three BCs (32%). Six biomarkers, CRE, TSH, HB, MONO, RBC, and LH, were found to have the greatest impact on BCQ outcomes inYang-Xusubjects. This study indicated significant biochemical differences inYang-Xusubjects, which may provide a connection between blood variables and theYang-XuBC.

Download Full-text

Logistic Regression Model for Loan Prediction: A Machine Learning Approach

10.1109/eti4.051663.2021.9619201 ◽

2021 ◽

Author(s):

Richa Manglani ◽

Anuja Bokhare

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Regression Model ◽

Logistic Regression Model ◽

Learning Approach ◽

Machine Learning Approach

Download Full-text