Privacy Exposure Measure: A Privacy-Preserving Technique for Health-Related Web Search

2019 ◽  
Vol 9 (6) ◽  
pp. 1196-1204 ◽  
Author(s):  
Rafiullah Khan ◽  
Muhammad Arshad Islam ◽  
Mohib Ullah ◽  
Muhammad Aleem ◽  
Muhammad Azhar Iqbal

The increasing use of web search engines (WSEs) for searching healthcare information has resulted in a growing number of users posting personal health information online. A recent survey demonstrates that over 80% of patients use WSE to seek health information. However, WSE stores these user's queries to analyze user behavior, result ranking, personalization, targeted advertisements, and other activities. Since health-related queries contain privacy-sensitive information that may infringe user's privacy. Therefore, privacy-preserving web search techniques such as anonymizing networks, profile obfuscation, private information retrieval (PIR) protocols etc. are used to ensure the user's privacy. In this paper, we propose Privacy Exposure Measure (PEM), a technique that facilitates user to control his/her privacy exposure while using the PIR protocols. PEM assesses the similarity between the user's profile and query before posting to WSE and assists the user in avoiding privacy exposure. The experiments demonstrate 37.2% difference between users' profile created through PEM-powered-PIR protocol and other usual users' profile. Moreover, PEM offers more privacy to the user even in case of machine-learning attack.

Author(s):  
Michael Snyder

What other types of personal health information can be readily collected? Most health-related measurements are administered in or through a doctor’s office and are typically taken when we are sick; the measurements that are taken when we are healthy are infrequent, and we often...


2020 ◽  
Vol 2020 ◽  
pp. 1-11 ◽  
Author(s):  
Rafiullah Khan ◽  
Arshad Ahmad ◽  
Alhuseen Omar Alsayed ◽  
Muhammad Binsawad ◽  
Muhammad Arshad Islam ◽  
...  

With the advancement in ICT, web search engines have become a preferred source to find health-related information published over the Internet. Google alone receives more than one billion health-related queries on a daily basis. However, in order to provide the results most relevant to the user, WSEs maintain the users’ profiles. These profiles may contain private and sensitive information such as the user’s health condition, disease status, and others. Health-related queries contain privacy-sensitive information that may infringe user’s privacy, as the identity of a user is exposed and may be misused by the WSE and third parties. This raises serious concerns since the identity of a user is exposed and may be misused by third parties. One well-known solution to preserve privacy involves issuing the queries via peer-to-peer private information retrieval protocol, such as useless user profile (UUP), thereby hiding the user’s identity from the WSE. This paper investigates the level of protection offered by UUP. For this purpose, we present QuPiD (query profile distance) attack: a machine learning-based attack that evaluates the effectiveness of UUP in privacy protection. QuPiD attack determines the distance between the user’s profile (web search history) and upcoming query using our proposed novel feature vector. The experiments were conducted using ten classification algorithms belonging to the tree-based, rule-based, lazy learner, metaheuristic, and Bayesian families for the sake of comparison. Furthermore, two subsets of an America Online dataset (noisy and clean datasets) were used for experimentation. The results show that the proposed QuPiD attack associates more than 70% queries to the correct user with a precision of over 72% for the clean dataset, while for the noisy dataset, the proposed QuPiD attack associates more than 40% queries to the correct user with 70% precision.


2020 ◽  
Author(s):  
Mohamed Abdalla ◽  
Moustafa Abdalla ◽  
Graeme Hirst ◽  
Frank Rudzicz

BACKGROUND Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.


10.2196/18055 ◽  
2020 ◽  
Vol 22 (7) ◽  
pp. e18055
Author(s):  
Mohamed Abdalla ◽  
Moustafa Abdalla ◽  
Graeme Hirst ◽  
Frank Rudzicz

Background Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. Objective This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. Methods We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. Results We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. Conclusions Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.


Sign in / Sign up

Export Citation Format

Share Document