Open Imputation Server provides secure Imputation services with provable genomic privacy

As DNA sequencing data is available for personal use, genomic privacy is becoming a major challenge. Nevertheless, high-throughput genomic data analysis outsourcing is performed using pipelines that tend to overlook these challenges. Results: We present a client-server-based outsourcing framework for genotype imputation, an important step in genomic data analyses. Genotype data is encrypted by the client and encrypted data are used by the server that never observes the data in plain. Cloud-based framework can benefit from virtually unlimited computational resources while providing provable confidentiality. Availability: Server is publicly available at https://www.secureomics.org/OpenImpute. Users can anonymously test and use imputation server without registration.

Download Full-text

Denoising of Aligned Genomic Data

Scientific Reports ◽

10.1038/s41598-019-51418-z ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Irena Fischer-Hwang ◽

Idoia Ochoa ◽

Tsachy Weissman ◽

Mikel Hernaez

Keyword(s):

State Of The Art ◽

Variant Calling ◽

Genomic Data ◽

Clinical Settings ◽

Data Sets ◽

Sequencing Data ◽

Denoising Method ◽

Data Set ◽

Genomic Data Analysis ◽

Variant Identification

Abstract Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE.

Download Full-text

Denoising of Aligned Genomic Data

10.1101/590372 ◽

2019 ◽

Author(s):

Irena Fischer-Hwang ◽

Idoia Ochoa ◽

Tsachy Weissman ◽

Mikel Hernaez

Keyword(s):

State Of The Art ◽

Variant Calling ◽

Genomic Data ◽

Clinical Settings ◽

Data Sets ◽

Sequencing Data ◽

Denoising Method ◽

Data Set ◽

Genomic Data Analysis ◽

Variant Identification

ABSTRACTNoise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available athttps://github.com/ihwang/SAMDUDE.

Download Full-text

Ultra-Fast Homomorphic Encryption Models enable Secure Outsourcing of Genotype Imputation

10.1101/2020.07.02.183459 ◽

2020 ◽

Author(s):

Miran Kim ◽

Arif Harmanci ◽

Jean-Philippe Bossuat ◽

Sergiu Carpov ◽

Jung Hee Cheon ◽

...

Keyword(s):

Data Analysis ◽

Homomorphic Encryption ◽

Genomic Data ◽

Genetic Data ◽

Genotype Imputation ◽

Privacy Concerns ◽

Data Owner ◽

Lower Accuracy ◽

Genomic Data Analysis ◽

In Transit

ABSTRACTGenotype imputation is a fundamental step in genomic data analysis such as GWAS, where missing variant genotypes are predicted using the existing genotypes of nearby ‘tag’ variants. Imputation greatly decreases the genotyping cost and provides high-quality estimates of common variant genotypes. As population panels increase, e.g., the TOPMED Project, genotype imputation is becoming more accurate, but it requires high computational power. Although researchers can outsource genotype imputation, privacy concerns may prohibit genetic data sharing with an untrusted imputation service. To address this problem, we developed the first fully secure genotype imputation by utilizing ultra-fast homomorphic encryption (HE) techniques that can evaluate millions of imputation models in seconds. In HE-based methods, the genotype data is end-to-end encrypted, i.e., encrypted in transit, at rest, and, most importantly, in analysis, and can be decrypted only by the data owner. We compared secure imputation with three other state-of-the-art non-secure methods under different settings. We found that HE-based methods provide full genetic data security with comparable or slightly lower accuracy. In addition, HE-based methods have time and memory requirements that are comparable and even lower than the non-secure methods. We provide five different implementations and workflows that make use of three cutting-edge HE schemes (BFV, CKKS, TFHE) developed by the top contestants of the iDASH19 Genome Privacy Challenge. Our results provide strong evidence that HE-based methods can practically perform resource-intensive computations for high throughput genetic data analysis. In addition, the publicly available codebases provide a reference for the development of secure genomic data analysis methods.

Download Full-text

Characterizing Performance Variation of Genomic Data Analysis Workflows on the Public Cloud

2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech) ◽

10.1109/dasc-picom-cbdcom-cyberscitech49142.2020.00116 ◽

2020 ◽

Author(s):

David Perez ◽

Ling-Hong Hung ◽

Sonia Xu ◽

Ka Yee Yeung ◽

Wes Lloyd

Keyword(s):

Data Analysis ◽

Genomic Data ◽

Public Cloud ◽

The Public ◽

Performance Variation ◽

Genomic Data Analysis

Download Full-text

Statistical Genetics for Genomic Data Analysis

Springer Handbook of Engineering Statistics ◽

10.1007/978-1-84628-288-1_32 ◽

2006 ◽

pp. 591-605

Author(s):

Jae Lee

Keyword(s):

Data Analysis ◽

Genomic Data ◽

Statistical Genetics ◽

Genomic Data Analysis

Download Full-text

Introduction to R for Genomic Data Analysis

Computational Genomics with R ◽

10.1201/9780429084317-2 ◽

2020 ◽

pp. 23-66

Author(s):

Altuna Akalin

Keyword(s):

Data Analysis ◽

Genomic Data ◽

Genomic Data Analysis

Download Full-text

EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

10.1101/2022.01.11.475810 ◽

2022 ◽

Author(s):

Lars Wienbrandt ◽

David Ellinghaus

Keyword(s):

Memory Management ◽

Imputation Accuracy ◽

Simulated Data ◽

Genotype Imputation ◽

Whole Genome Sequencing Data ◽

Common Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Genome Wide ◽

Reference Genomes

Background: Reference-based phasing and genotype imputation algorithms have been developed with sublinear theoretical runtime behaviour, but runtimes are still high in practice when large genome-wide reference datasets are used. Methods: We developed EagleImp, a software with algorithmic and technical improvements and new features for accurate and accelerated phasing and imputation in a single tool. Results: We compared accuracy and runtime of EagleImp with Eagle2, PBWT and prominent imputation servers using whole-genome sequencing data from the 1000 Genomes Project, the Haplotype Reference Consortium and simulated data with more than 1 million reference genomes. EagleImp is 2 to 10 times faster (depending on the single or multiprocessor configuration selected) than Eagle2/PBWT, with the same or better phasing and imputation quality in all tested scenarios. For common variants investigated in typical GWAS studies, EagleImp provides same or higher imputation accuracy than the Sanger Imputation Service, Michigan Imputation Server and the newly developed TOPMed Imputation Server, despite larger (not publicly available) reference panels. It has many new features, including automated chromosome splitting and memory management at runtime to avoid job aborts, fast reading and writing of large files, and various user-configurable algorithm and output options. Conclusions: Due to the technical optimisations, EagleImp can perform fast and accurate reference-based phasing and imputation for future very large reference panels with more than 1 million genomes. EagleImp is freely available for download from https://github.com/ikmb/eagleimp.

Download Full-text

Spatially constrained tumour growth affects the patterns of clonal selection and neutral drift in cancer genomic data

10.1101/544536 ◽

2019 ◽

Cited By ~ 3

Author(s):

Kate Chkhaidze ◽

Timon Heide ◽

Benjamin Werner ◽

Marc J. Williams ◽

Weini Huang ◽

...

Keyword(s):

Next Generation Sequencing ◽

Tumour Growth ◽

Evolutionary Dynamics ◽

Clonal Selection ◽

Genomic Data ◽

Confounding Factors ◽

Data Generation ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

AbstractQuantification of the effect of spatial tumour sampling on the patterns of mutations detected in next-generation sequencing data is largely lacking. Here we use a spatial stochastic cellular automaton model of tumour growth that accounts for somatic mutations, selection, drift and spatial constrains, to simulate multi-region sequencing data derived from spatial sampling of a neoplasm. We show that the spatial structure of a solid cancer has a major impact on the detection of clonal selection and genetic drift from bulk sequencing data and single-cell sequencing data. Our results indicate that spatial constrains can introduce significant sampling biases when performing multi-region bulk sampling and that such bias becomes a major confounding factor for the measurement of the evolutionary dynamics of human tumours. We present a statistical inference framework that takes into account the spatial effects of a growing tumour and allows inferring the evolutionary dynamics from patient genomic data. Our analysis shows that measuring cancer evolution using next-generation sequencing while accounting for the numerous confounding factors requires a mechanistic model-based approach that captures the sources of noise in the data.SummarySequencing the DNA of cancer cells from human tumours has become one of the main tools to study cancer biology. However, sequencing data are complex and often difficult to interpret. In particular, the way in which the tissue is sampled and the data are collected, impact the interpretation of the results significantly. We argue that understanding cancer genomic data requires mathematical models and computer simulations that tell us what we expect the data to look like, with the aim of understanding the impact of confounding factors and biases in the data generation step. In this study, we develop a spatial simulation of tumour growth that also simulates the data generation process, and demonstrate that biases in the sampling step and current technological limitations severely impact the interpretation of the results. We then provide a statistical framework that can be used to overcome these biases and more robustly measure aspects of the biology of tumours from the data.

Download Full-text