Iterative learning of single individual haplotypes from high-throughput DNA sequencing data

AbstractHigh-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods, but suffers from biases and artifacts that lead to false discoveries and low sensitivity. We describe CODEX2, a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs.

Download Full-text

A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5414 ◽

2020 ◽

Vol 34 (01) ◽

pp. 719-726

Author(s):

Ziqi Ke ◽

Haris Vikalo

Keyword(s):

Dna Sequencing ◽

Single Individual ◽

Sequencing Data ◽

Genomic Component ◽

Source Codes ◽

Sequencing Errors ◽

Supplementary Document ◽

Viral Communities ◽

High Throughput Dna Sequencing ◽

Sequencing Platforms

Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads' origin – an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posterior probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques. Source codes, datasets and supplementary document are available at https://github.com/WuLoli/GAEseq.

Download Full-text

Enhancing the detection of barcoded reads in high throughput DNA sequencing data by controlling the false discovery rate

BMC Bioinformatics ◽

10.1186/1471-2105-15-264 ◽

2014 ◽

Vol 15 (1) ◽

Cited By ~ 8

Author(s):

Tilo Buschmann ◽

Rong Zhang ◽

Douglas E Brash ◽

Leonid V Bystrykh

Keyword(s):

Dna Sequencing ◽

False Discovery Rate ◽

High Throughput ◽

Sequencing Data ◽

False Discovery ◽

High Throughput Dna Sequencing

Download Full-text

A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

10.1101/837674 ◽

2019 ◽

Author(s):

Ziqi Ke ◽

Haris Vikalo

Keyword(s):

Dna Sequencing ◽

Single Individual ◽

Sequencing Data ◽

Genomic Component ◽

Source Codes ◽

Sequencing Errors ◽

Learning Framework ◽

Viral Communities ◽

High Throughput Dna Sequencing ◽

Sequencing Platforms

AbstractReconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads’ origin – an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posterior probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques. Source codes and datasets are publicly available at https://github.com/WuLoli/GAEseq.

Download Full-text