A multi-task convolutional deep neural network for variant calling in single molecule sequencing

AbstractThe accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5%-15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieved 99.73%, 97.68% and 95.36% precision on known variants, and 98.65%, 92.57%, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than two hours on a standard server. Furthermore, we identified 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source (https://github.com/aquaskyline/Clairvoyante), with modules to train, utilize and visualize the model.

Download Full-text

Clair: Exploring the limit of using a deep neural network on pileup data for germline variant calling

10.1101/865782 ◽

2019 ◽

Cited By ~ 5

Author(s):

Ruibang Luo ◽

Chak-Lim Wong ◽

Yat-Sing Wong ◽

Chi-Ian Tang ◽

Chi-Man Liu ◽

...

Keyword(s):

Single Molecule ◽

Deep Neural Network ◽

Deep Neural Networks ◽

New Technologies ◽

Variant Calling ◽

Epigenetic Mark ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Complex Genome

AbstractSingle-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly, and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited the new technologies from being more widely used. In this study, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single molecule sequencing data. For ONT data, Clair achieves the best precision, recall and speed as compared to several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks. Clair requires only a conventional CPU for variant calling and is an open source project available at https://github.com/HKU-BAL/Clair.

Download Full-text

AsmMix: A pipeline for high quality diploid de novo assembly

10.1101/2021.01.15.426893 ◽

2021 ◽

Author(s):

Pei Wu ◽

Chao Liu ◽

Ou Wang ◽

Xia Zhao ◽

Fang Chen ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Variant Calling ◽

The Other ◽

Second Step ◽

Small Scale ◽

Mixing Process ◽

High Quality ◽

Single Molecule Sequencing ◽

Long Read

AbstractIn this paper, we report a pipeline, AsmMix, which is capable of producing both contiguous and high-quality diploid genomes. The pipeline consists of two steps. In the first step, two sets of assemblies are generated: one is based on co-barcoded reads, which are highly accurate and haplotype-resolved but contain many gaps, the other assembly is based on single-molecule sequencing reads, which is contiguous but error-prone. In the second step, those two sets of assemblies are compared and integrated into a haplotype-resolved assembly with fewer errors. We test our pipeline using a dataset of human genome NA24385, perform variant calling from those assemblies and then compare against GIAB Benchmark. We show that AsmMix pipeline could produce highly contiguous, accurate, and haplotype-resolved assemblies. Especially the assembly mixing process could effectively reduce small-scale errors in the long read assembly.

Download Full-text

Exploring the limit of using a deep neural network on pileup data for germline variant calling

Nature Machine Intelligence ◽

10.1038/s42256-020-0167-4 ◽

2020 ◽

Vol 2 (4) ◽

pp. 220-227 ◽

Cited By ~ 15

Author(s):

Ruibang Luo ◽

Chak-Lim Wong ◽

Yat-Sing Wong ◽

Chi-Ian Tang ◽

Chi-Man Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Variant Calling ◽

Germline Variant

Download Full-text

Fast and robust multiplane single-molecule localization microscopy using a deep neural network

Neurocomputing ◽

10.1016/j.neucom.2021.04.050 ◽

2021 ◽

Author(s):

Toshimitsu Aritake ◽

Hideitsu Hino ◽

Shigeyuki Namiki ◽

Daisuke Asanuma ◽

Kenzo Hirose ◽

...

Keyword(s):

Neural Network ◽

Single Molecule ◽

Deep Neural Network ◽

Localization Microscopy ◽

Single Molecule Localization Microscopy

Download Full-text

Efficient long single molecule sequencing for cost effective and accurate sequencing, haplotyping, and de novo assembly

10.1101/324392 ◽

2018 ◽

Author(s):

Ou Wang ◽

Robert Chin ◽

Xiaofang Cheng ◽

Michelle Ka Wu ◽

Qing Mao ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

De Novo ◽

Low Cost ◽

Variant Calling ◽

Cost Effective ◽

High Quality ◽

Single Molecule Sequencing ◽

Single Tube ◽

Complex Structural

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

Download Full-text

HELLO: improved neural network architectures and methodologies for small variant calling

BMC Bioinformatics ◽

10.1186/s12859-021-04311-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Anand Ramachandran ◽

Steven S. Lumetta ◽

Eric W. Klee ◽

Deming Chen

Keyword(s):

Neural Network ◽

Neural Networks ◽

Image Recognition ◽

Deep Neural Network ◽

Deep Neural Networks ◽

Method Development ◽

Variant Calling ◽

Network Architectures ◽

Sequencing Data ◽

Neural Network Architectures

Abstract Background Modern Next Generation- and Third Generation- Sequencing methods such as Illumina and PacBio Circular Consensus Sequencing platforms provide accurate sequencing data. Parallel developments in Deep Learning have enabled the application of Deep Neural Networks to variant calling, surpassing the accuracy of classical approaches in many settings. DeepVariant, arguably the most popular among such methods, transforms the problem of variant calling into one of image recognition where a Deep Neural Network analyzes sequencing data that is formatted as images, achieving high accuracy. In this paper, we explore an alternative approach to designing Deep Neural Networks for variant calling, where we use meticulously designed Deep Neural Network architectures and customized variant inference functions that account for the underlying nature of sequencing data instead of converting the problem to one of image recognition. Results Results from 27 whole-genome variant calling experiments spanning Illumina, PacBio and hybrid Illumina-PacBio settings suggest that our method allows vastly smaller Deep Neural Networks to outperform the Inception-v3 architecture used in DeepVariant for indel and substitution-type variant calls. For example, our method reduces the number of indel call errors by up to 18%, 55% and 65% for Illumina, PacBio and hybrid Illumina-PacBio variant calling respectively, compared to a similarly trained DeepVariant pipeline. In these cases, our models are between 7 and 14 times smaller. Conclusions We believe that the improved accuracy and problem-specific customization of our models will enable more accurate pipelines and further method development in the field. HELLO is available at https://github.com/anands-repo/hello

Download Full-text

Accurate and rapid background estimation in single-molecule localization microscopy using the deep neural network BGnet

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1916219117 ◽

2019 ◽

Vol 117 (1) ◽

pp. 60-67 ◽

Cited By ~ 8

Author(s):

Leonhard Möckl ◽

Anish R. Roy ◽

Petar N. Petrov ◽

W. E. Moerner

Keyword(s):

Neural Network ◽

Single Molecule ◽

Deep Neural Network ◽

Substantial Improvement ◽

Localization Microscopy ◽

3 Dimensional ◽

Background Estimation ◽

Spatial Features ◽

Spatial Frequencies ◽

Wide Range

Background fluorescence, especially when it exhibits undesired spatial features, is a primary factor for reduced image quality in optical microscopy. Structured background is particularly detrimental when analyzing single-molecule images for 3-dimensional localization microscopy or single-molecule tracking. Here, we introduce BGnet, a deep neural network with a U-net-type architecture, as a general method to rapidly estimate the background underlying the image of a point source with excellent accuracy, even when point-spread function (PSF) engineering is in use to create complex PSF shapes. We trained BGnet to extract the background from images of various PSFs and show that the identification is accurate for a wide range of different interfering background structures constructed from many spatial frequencies. Furthermore, we demonstrate that the obtained background-corrected PSF images, for both simulated and experimental data, lead to a substantial improvement in localization precision. Finally, we verify that structured background estimation with BGnet results in higher quality of superresolution reconstructions of biological structures.

Download Full-text