scholarly journals Identifying disease-causing mutations with privacy protection

Author(s):  
Mete Akgün ◽  
Ali Burak Ünal ◽  
Bekir Ergüner ◽  
Nico Pfeifer ◽  
Oliver Kohlbacher

Abstract Motivation The use of genome data for diagnosis and treatment is becoming increasingly common. Researchers need access to as many genomes as possible to interpret the patient genome, to obtain some statistical patterns and to reveal disease–gene relationships. The sensitive information contained in the genome data and the high risk of re-identification increase the privacy and security concerns associated with sharing such data. In this article, we present an approach to identify disease-associated variants and genes while ensuring patient privacy. The proposed method uses secure multi-party computation to find disease-causing mutations under specific inheritance models without sacrificing the privacy of individuals. It discloses only variants or genes obtained as a result of the analysis. Thus, the vast majority of patient data can be kept private. Results Our prototype implementation performs analyses on thousands of genomic data in milliseconds, and the runtime scales logarithmically with the number of patients. We present the first inheritance model (recessive, dominant and compound heterozygous) based privacy-preserving analyses of genomic data to find disease-causing mutations. Furthermore, we re-implement the privacy-preserving methods (MAX, SETDIFF and INTERSECTION) proposed in a previous study. Our MAX, SETDIFF and INTERSECTION implementations are 2.5, 1122 and 341 times faster than the corresponding operations of the state-of-the-art protocol, respectively. Availability and implementation https://gitlab.com/DIFUTURE/privacy-preserving-genomic-diagnosis. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Nour Almadhoun ◽  
Erman Ayday ◽  
Özgür Ulusoy

Abstract Motivation The rapid progress in genome sequencing has led to high availability of genomic data. However, due to growing privacy concerns about the participant’s sensitive information, accessing results and data of genomic studies is restricted to only trusted individuals. On the other hand, paving the way to biomedical discoveries requires granting open access to genomic databases. Privacy-preserving mechanisms can be a solution for granting wider access to such data while protecting their owners. In particular, there has been growing interest in applying the concept of differential privacy (DP) while sharing summary statistics about genomic data. DP provides a mathematically rigorous approach but it does not consider the dependence between tuples in a database, which may degrade the privacy guarantees offered by the DP. Results In this work, focusing on genomic databases, we show this drawback of DP and we propose techniques to mitigate it. First, using a real-world genomic dataset, we demonstrate the feasibility of an inference attack on differentially private query results by utilizing the correlations between the tuples in the dataset. The results show that the adversary can infer sensitive genomic data about a user from the differentially private query results by exploiting correlations between genomes of family members. Second, we propose a mechanism for privacy-preserving sharing of statistics from genomic datasets to attain privacy guarantees while taking into consideration the dependence between tuples. By evaluating our mechanism on different genomic datasets, we empirically demonstrate that our proposed mechanism can achieve up to 50% better privacy than traditional DP-based solutions. Availability https://github.com/nourmadhoun/Differential-privacy-genomic-inference-attack. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 2018 (4) ◽  
pp. 104-124 ◽  
Author(s):  
Gilad Asharov ◽  
Shai Halevi ◽  
Yehuda Lindell ◽  
Tal Rabin

Abstract The growing availability of genomic data holds great promise for advancing medicine and research, but unlocking its full potential requires adequate methods for protecting the privacy of individuals whose genome data we use. One example of this tension is running Similar Patient Query on remote genomic data: In this setting a doctor that holds the genome of his/her patient may try to find other individuals with “close” genomic data, and use the data of these individuals to help diagnose and find effective treatment for that patient’s conditions. This is clearly a desirable mode of operation. However, the privacy exposure implications are considerable, and so we would like to carry out the above “closeness” computation in a privacy preserving manner. In this work we put forward a new approach for highly efficient secure computation for computing an approximation of the Similar Patient Query problem. We present contributions on two fronts. First, an approximation method that is designed with the goal of achieving efficient private computation. Second, further optimizations of the two-party protocol. Our tests indicate that the approximation method works well, it returns the exact closest records in 98% of the queries and very good approximation otherwise. As for speed, our protocol implementation takes just a few seconds to run on databases with thousands of records, each of length thousands of alleles, and it scales almost linearly with both the database size and the length of the sequences in it. As an example, in the datasets of the recent iDASH competition, after a one-time preprocessing of around 12 seconds, it takes around a second to find the nearest five records to a query, in a size-500 dataset of length- 3500 sequences. This is 2-3 orders of magnitude faster than using state-of-the-art secure protocols with existing edit distance algorithms.


2019 ◽  
Vol 16 (8) ◽  
pp. 3538-3543
Author(s):  
Adegunwa ◽  
Oluwabiyi Akinkunmi ◽  
Muhammad Ehsan Rana

In recent times, especially since the beginning of the new millennium, governments, industry players, IT firms and business enterprises have given more consideration to the use of data for their decision and operational processes. This data, that usually contain users, clients and customers’ information, is collected using varying infrastructure, instruments and techniques. The technological breakthroughs in the health industry and the digitalization of medical records i.e., transformation into Electronic Health Records (EHRs) brings about the possibilities of accessing health records in real-time anywhere through the use of big data, aimed at reducing cost and increasing profits within the healthcare industry. However with this advancement, threats to the privacy and security of healthcare records have inevitably creeped in because of malicious attacks. This paper is directed at addressing privacy and security related issues associated with big data i.e., Privacy Preserving Data Publishing (PPDP) methods useful for the medical world. It seeks to explore various possible methods and techniques that can render data anonymously by using anonymization processes i.e., untraceable to the original data owners. This restricts the possibilities of patient privacy infraction by malicious elements, while making the data available for analytical purposes. The anonymization process here is achieved through data publishers who stand as a middleman between data owners and the data recipient and ensures that the privacy of data owners is preserved at all times.


2020 ◽  
Author(s):  
Mohamed Abdalla ◽  
Moustafa Abdalla ◽  
Graeme Hirst ◽  
Frank Rudzicz

BACKGROUND Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.


2019 ◽  
Author(s):  
Shamila Mohammed

BACKGROUND In the past decade, importance of genomic data has been increased in medical research. The cost of genomic sequencing is reducing day by day we can include genomic data in routine medical care. This data is being used to detect/ prevent inherited diseases. But using this data in research purpose may increase the chance of leakage privacy or genetic information (sensitive information of individuals) to unidentified users. In current, many issues and challenges exist in preserving privacy of genomic data. In general, Identity Tracing attack, completion attack and attribute disclosure attack are three attacks (mitigated) on genomic data (in-current). Also, accessing and integrating genomic is difficult to handle and analysis to make a useful decision for future. This paper discusses about the available sequencing methods (for genomic data), where and how genomic data will be useful in prediction (i.e., in various applications). And also provide a picture of future using genomic analytics for extracting useful patterns from this data.Note that many attempts have been made towards this topic but all existed worksare strictly rule based, i.e., has no quantitative measurement of the risk of privacy breaches (genotype and phenotype information). Here, privacy-preserving linkage of genotype and phenotype information (across different locations) means genotypes stored in a sequencing facility and phenotypes stored in an electronic health record. This article discusses about several aspects in genomic privacy, with a focus on security vulnerabilities identified by them and their (possible) suggested solutions. In this article, we focus to accelerate discoveries using best prediction tools with explaining a clear cut approach, i.e., we need to protect genomic data or not or it is just a myth. In last, we listed several genomic data protection techniques against re-identification attacks and systematic comparison of existing genomic privacy preserving methodologies (attempts made by several researchers in the previous decade) in Appendix A. OBJECTIVE importance of genomic data existing genomic data privacy preservation methods comparison METHODS Re- identification Cryptographic RESULTS Comparison of different methods CONCLUSIONS Privacy is a sensitive issue and need to be protected from outsider world/ from malicious (unidentified) users. Towards this serious concern, in this article we have shared several useful suggestions, opinions with respect to genomic data (also other type of data). This paper has started with introduction to genomic data (also its characteristics), to its scope/ importance in medical care. We highlighted the related works done towards this area. We also explained evolution of genomic sequencing and various metrics to measure the performance. Later we explained the importance of genomic data in terms of where this data is useful and why it is useful with the help of one use case. Then we described how genomic data is different from other types of big data. Later, we have discussed several serious concerns, challenges, and research gaps and have provided some opportunity to the future researchers (in genomic privacy). In this article we also make a comparison between genomic privacy and other types of privacy (in brief). Hence, we find out that privacy especially genomic is necessary to protect and require attention form research communities. We request to computer science community to provide/ make/ develop some techniques for data privacy and confidentiality protection, which work/ use on real –world problems/ tested.


Sensors ◽  
2020 ◽  
Vol 20 (7) ◽  
pp. 1898 ◽  
Author(s):  
Junsong Fu ◽  
Na Wang ◽  
Yuanyuan Cai

Electronic medical records (EMRs) are extremely important for patients’ treatment, doctors’ diagnoses, and medical technology development. In recent years, the distributed healthcare blockchain system has been researched for solving the information isolated island problem in centralized healthcare service systems. However, there still exists a series of important problems such as the patients’ sensitive information security, cross-institutional data sharing, medical quality, and efficiency. In this paper, we establish a lightweight privacy-preserving mechanism for a healthcare blockchain system. First, we apply an interleaving encoder to encrypt the original EMRs. This can hide the sensitive information of EMRs to protect the patient’s privacy security. Second, a ( t , n )-threshold lightweight message sharing scheme is presented. The EMRs are mapped to n different short shares, and it can be reconstructed by at least t shares. The EMR shares rather than the original EMRs are stored in the blockchain nodes. This can guarantee high security for EMR sharing and improve the data reconstruction efficiency. Third, the indexes of the stored EMR shares are employed to generate blocks that are chained together and finally form a blockchain. The authorized data users or institutions can recover an EMR by requesting at least t shares of the EMR from the blockchain nodes. In this way, the healthcare blockchain system can not only facilitate the cross-institution sharing process, but also provide proper protections for the EMRs. The security proof and analysis indicate that the proposed scheme can protect the privacy and security of patients’ medical information. The simulation results show that our proposed scheme is more efficient than similar literature in terms of energy consumption and storage space, and the healthcare blockchain system is more stable with the proposed message sharing scheme.


Data Science ◽  
2021 ◽  
Vol 4 (2) ◽  
pp. 121-150
Author(s):  
Chang Sun ◽  
Lianne Ippel ◽  
Andre Dekker ◽  
Michel Dumontier ◽  
Johan van Soest

Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to overcome these challenges by extracting knowledge from partitioned data while minimizing the release of sensitive information. This paper reports the results and findings of a systematic review of PPDDM techniques from 231 scientific articles published in the past 20 years. We summarize the state of the art, compare the problems they address, and identify the outstanding challenges in the field. This review identifies the consequence of the lack of standard criteria to evaluate new PPDDM methods and proposes comprehensive evaluation criteria with 10 key factors. We discuss the ambiguous definitions of privacy and confusion between privacy and security in the field, and provide suggestions of how to make a clear and applicable privacy description for new PPDDM techniques. The findings from our review enhance the understanding of the challenges of applying theoretical PPDDM methods to real-life use cases, and the importance of involving legal-ethical and social experts in implementing PPDDM methods. This comprehensive review will serve as a helpful guide to past research and future opportunities in the area of PPDDM.


10.2196/18055 ◽  
2020 ◽  
Vol 22 (7) ◽  
pp. e18055
Author(s):  
Mohamed Abdalla ◽  
Moustafa Abdalla ◽  
Graeme Hirst ◽  
Frank Rudzicz

Background Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. Objective This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. Methods We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. Results We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. Conclusions Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.


Author(s):  
S. Karthiga Devi ◽  
B. Arputhamary

Today the volume of healthcare data generated increased rapidly because of the number of patients in each hospital increasing.  These data are most important for decision making and delivering the best care for patients. Healthcare providers are now faced with collecting, managing, storing and securing huge amounts of sensitive protected health information. As a result, an increasing number of healthcare organizations are turning to cloud based services. Cloud computing offers a viable, secure alternative to premise based healthcare solutions. The infrastructure of Cloud is characterized by a high volume storage and a high throughput. The privacy and security are the two most important concerns in cloud-based healthcare services. Healthcare organization should have electronic medical records in order to use the cloud infrastructure. This paper surveys the challenges of cloud in healthcare and benefits of cloud techniques in health care industries.


2016 ◽  
Vol 6 (1) ◽  
pp. 75-81 ◽  
Author(s):  
B. Blobel ◽  
D. M. Lopez ◽  
C. Gonzalez

Sign in / Sign up

Export Citation Format

Share Document