Privacy-Preserving String Search on Encrypted Genomic Data using a Generalized Suffix Tree

In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario.

Download Full-text

Efficient and Privacy-preserving Similar Patients Query Scheme over Outsourced Genomic Data

IEEE Transactions on Cloud Computing ◽

10.1109/tcc.2021.3131287 ◽

2021 ◽

pp. 1-1

Author(s):

Dan Zhu ◽

Hui Zhu ◽

Xiangyu Wang ◽

Rongxing Lu ◽

Dengguo Feng

Keyword(s):

Genomic Data ◽

Privacy Preserving

Download Full-text

Sketching Algorithms for Genomic Data Analysis and Querying in a Secure Enclave

10.1101/468355 ◽

2018 ◽

Author(s):

Can Kockan ◽

Kaiyuan Zhu ◽

Natnatee Dokmai ◽

Nikolai Karpov ◽

Oguzhan Kulekci ◽

...

Keyword(s):

Data Analysis ◽

Data Structures ◽

Homomorphic Encryption ◽

Genomic Medicine ◽

Genomic Data ◽

Privacy Preserving ◽

Snp Analysis ◽

Data Set ◽

Genomic Data Analysis ◽

Cryptographic Techniques

Current practices in collaborative genomic data analysis (e.g. PCAWG) necessitate all involved parties to exchange individual patient data and perform all analysis locally, or use a trusted server for maintaining all data to perform analysis in a single site (e.g. the Cancer Genome Collaboratory). Since both approaches involve sharing genomic sequence data - which is typically not feasible due to privacy issues, collaborative data analysis remains to be a rarity in genomic medicine. In order to facilitate efficient and effective collaborative or remote genomic computation we introduce SkSES (Sketching algorithms for Secure Enclave based genomic data analysiS), a computational framework for performing data analysis and querying on multiple, individually encrypted genomes from several institutions in an untrusted cloud environment. Unlike other techniques for secure/privacy preserving genomic data analysis, which typically rely on sophisticated cryptographic techniques with prohibitively large computational overheads, SkSES utilizes the secure enclaves supported by current generation microprocessor architectures such as Intel's SGX. The key conceptual contribution of SkSES is its use of sketching data structures that can fit in the limited memory available in a secure enclave. While streaming/sketching algorithms have been developed for many applications in computer science, their feasibility in genomics has remained largely unexplored. On the other hand, even though privacy and security issues are becoming critical in genomic medicine, available cryptographic techniques based on, e.g. homomorphic encryption or garbled circuits, fail to address the performance demands of this rapidly growing field. The alternative offered by Intel's SGX, a combination of hardware and software solutions for secure data analysis, is severely limited by the relatively small size of a secure enclave, a private region of the memory protected from other processes. SkSES addresses this limitation through the use of sketching data structures to support efficient secure and privacy preserving SNP analysis across individually encrypted VCF files from multiple institutions. In particular SkSES provides the users the ability to query for the "k" most significant SNPs among any set of user specified SNPs and any value of "k" - even when the total number of SNPs to be maintained is far beyond the memory capacity of the secure enclave. Results: We tested SkSES on the complete iDASH-2017 competition data set comprised of 1000 case and 1000 control samples related to an unknown phenotype. SkSES was able to identify the top SNPs with respect to the chi-squared statistic, among any user specified subset of SNPs across this data set of 2000 individually encrypted complete human genomes quickly and accurately - demonstrating the feasibility of secure and privacy preserving computation for genomic medicine via Intel's SGX. Availability: https://github.com/ndokmai/sgx-genome-variants-search Contact: [email protected]

Download Full-text