scholarly journals Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)

2020 ◽  
Author(s):  
Mohamed Abdalla ◽  
Moustafa Abdalla ◽  
Graeme Hirst ◽  
Frank Rudzicz

BACKGROUND Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.

10.2196/18055 ◽  
2020 ◽  
Vol 22 (7) ◽  
pp. e18055
Author(s):  
Mohamed Abdalla ◽  
Moustafa Abdalla ◽  
Graeme Hirst ◽  
Frank Rudzicz

Background Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. Objective This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. Methods We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. Results We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. Conclusions Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.


2019 ◽  
Vol 9 (6) ◽  
pp. 1196-1204 ◽  
Author(s):  
Rafiullah Khan ◽  
Muhammad Arshad Islam ◽  
Mohib Ullah ◽  
Muhammad Aleem ◽  
Muhammad Azhar Iqbal

The increasing use of web search engines (WSEs) for searching healthcare information has resulted in a growing number of users posting personal health information online. A recent survey demonstrates that over 80% of patients use WSE to seek health information. However, WSE stores these user's queries to analyze user behavior, result ranking, personalization, targeted advertisements, and other activities. Since health-related queries contain privacy-sensitive information that may infringe user's privacy. Therefore, privacy-preserving web search techniques such as anonymizing networks, profile obfuscation, private information retrieval (PIR) protocols etc. are used to ensure the user's privacy. In this paper, we propose Privacy Exposure Measure (PEM), a technique that facilitates user to control his/her privacy exposure while using the PIR protocols. PEM assesses the similarity between the user's profile and query before posting to WSE and assists the user in avoiding privacy exposure. The experiments demonstrate 37.2% difference between users' profile created through PEM-powered-PIR protocol and other usual users' profile. Moreover, PEM offers more privacy to the user even in case of machine-learning attack.


2013 ◽  
Vol 41 (S1) ◽  
pp. 37-41 ◽  
Author(s):  
Khaled El Emam ◽  
Ester Moher

Even though health care provider reporting of diseases to public health authorities is common, often there is under-reporting by providers, including for notifiable diseases; frequently, under-reporting occurs by wide margins. Two causal factors for this under-reporting by providers have been that: (1) disclosing data may violate their patients’ privacy, and (2) disclosed data may be used to evaluate their performance. A reluctance to disclose information due to privacy concerns exists despite the U.S. Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule permitting disclosures of personal health information (PHI) for public health purposes without patient authorization. On the other hand, such patient privacy concerns are somewhat justified: there have been documented breaches of patient information from public health data custodians.A common way to address this privacy issue is to de-identify patient data before it is disclosed to public health.


2018 ◽  
Vol 27 (01) ◽  
pp. 060-066
Author(s):  
Linda Kloss ◽  
Melanie Brodnik ◽  
Laurie Rinehart-Thompson

Objectives: To assess the current health data access and disclosure environment for potential privacy-protecting mechanisms that enable legitimate use of personal health information while preserving the rights of individuals. To identify the gaps and challenges between increasing requests and expanding uses of such information and the regulations, technologies, and management practices that permit appropriate access and disclosure while guarding against harmful misuse of such information. Methods: A scoping literature review focused on (1) regulations affecting access and disclosure of personal health information, (2) the uses of health information that challenge access and disclosure boundaries, and (3) privacy management practices that may help mitigate gaps in protecting patient privacy. Results: Countries and jurisdictions are developing laws, regulations, and public policies to balance the privacy rights of individuals and the unprecedented opportunities to advance health and health care through expanded uses of health data. Regulations and guidance are evolving, but they are outpaced by the increasing demand for and the challenges of managing access and disclosure. Mechanisms such as consent and authorization may not always be adequate. Mechanisms that advance principled stewardship are more important than ever. Conclusions: Access and disclosure management are important dimensions of privacy management practices. This is a volatile period in which diverging public policies may reveal how best to balance access and disclosure of personal health information by individuals and by institutional custodians of the information. Approaches to access and disclosure management, including the roles of individuals, should be a focus for research and study in the years ahead.


Author(s):  
Mete Akgün ◽  
Ali Burak Ünal ◽  
Bekir Ergüner ◽  
Nico Pfeifer ◽  
Oliver Kohlbacher

Abstract Motivation The use of genome data for diagnosis and treatment is becoming increasingly common. Researchers need access to as many genomes as possible to interpret the patient genome, to obtain some statistical patterns and to reveal disease–gene relationships. The sensitive information contained in the genome data and the high risk of re-identification increase the privacy and security concerns associated with sharing such data. In this article, we present an approach to identify disease-associated variants and genes while ensuring patient privacy. The proposed method uses secure multi-party computation to find disease-causing mutations under specific inheritance models without sacrificing the privacy of individuals. It discloses only variants or genes obtained as a result of the analysis. Thus, the vast majority of patient data can be kept private. Results Our prototype implementation performs analyses on thousands of genomic data in milliseconds, and the runtime scales logarithmically with the number of patients. We present the first inheritance model (recessive, dominant and compound heterozygous) based privacy-preserving analyses of genomic data to find disease-causing mutations. Furthermore, we re-implement the privacy-preserving methods (MAX, SETDIFF and INTERSECTION) proposed in a previous study. Our MAX, SETDIFF and INTERSECTION implementations are 2.5, 1122 and 341 times faster than the corresponding operations of the state-of-the-art protocol, respectively. Availability and implementation https://gitlab.com/DIFUTURE/privacy-preserving-genomic-diagnosis. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document