2-kupl: mapping-free variant detection from DNA-seq data of matched samples

Abstract Background The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. Results We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. Conclusions We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.

Download Full-text

2-kupl: mapping-free variant detection from DNA-seq data of matched samples

10.1101/2021.01.17.427048 ◽

2021 ◽

Author(s):

Yunfeng Wang ◽

Haoliang Xue ◽

Christine Pourcel ◽

Yang Du ◽

Daniel Gautheret

Keyword(s):

Dna Sequences ◽

Point Mutations ◽

Low Complexity ◽

Structural Variants ◽

Bacterial Strains ◽

Common Reference ◽

Whole Exome ◽

Two Samples ◽

Large Indels ◽

Variant Detection

AbstractThe detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. Herein, we introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves a higher precision than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease.

Download Full-text

Blacklisting variants common in private cohorts but not in public databases optimizes human exome analysis

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1808403116 ◽

2018 ◽

Vol 116 (3) ◽

pp. 950-959 ◽

Cited By ~ 16

Author(s):

Patrick Maffucci ◽

Benedetta Bigio ◽

Franck Rapaport ◽

Aurélie Cobat ◽

Alessandro Borghesi ◽

...

Keyword(s):

Reference Genome ◽

Low Complexity ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Human Patient ◽

Reference Genome Assembly ◽

Exome Analysis ◽

User Friendly ◽

Computational Analyses ◽

Generation Sequencing

Computational analyses of human patient exomes aim to filter out as many nonpathogenic genetic variants (NPVs) as possible, without removing the true disease-causing mutations. This involves comparing the patient’s exome with public databases to remove reported variants inconsistent with disease prevalence, mode of inheritance, or clinical penetrance. However, variants frequent in a given exome cohort, but absent or rare in public databases, have also been reported and treated as NPVs, without rigorous exploration. We report the generation of a blacklist of variants frequent within an in-house cohort of 3,104 exomes. This blacklist did not remove known pathogenic mutations from the exomes of 129 patients and decreased the number of NPVs remaining in the 3,104 individual exomes by a median of 62%. We validated this approach by testing three other independent cohorts of 400, 902, and 3,869 exomes. The blacklist generated from any given cohort removed a substantial proportion of NPVs (11–65%). We analyzed the blacklisted variants computationally and experimentally. Most of the blacklisted variants corresponded to false signals generated by incomplete reference genome assembly, location in low-complexity regions, bioinformatic misprocessing, or limitations inherent to cohort-specific private alleles (e.g., due to sequencing kits, and genetic ancestries). Finally, we provide our precalculated blacklists, together with ReFiNE, a program for generating customized blacklists from any medium-sized or large in-house cohort of exome (or other next-generation sequencing) data via a user-friendly public web server. This work demonstrates the power of extracting variant blacklists from private databases as a specific in-house but broadly applicable tool for optimizing exome analysis.

Download Full-text

Aquila_stLFR: assembly based variant calling package for stLFR and hybrid assembly for linked-reads

10.1101/742239 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xin Zhou ◽

Lu Zhang ◽

Xiaodong Fang ◽

Yichen Liu ◽

David L. Dill ◽

...

Keyword(s):

De Novo ◽

Low Cost ◽

Variant Calling ◽

Hybrid Assembly ◽

Structural Variants ◽

Sequencing Data ◽

Single Tube ◽

Large Numbers ◽

Key Characteristics ◽

Hybrid Assemblies

AbstractHuman diploid genome assembly enables identifying maternal and paternal genetic variations. Algorithms based on 10x linked-read sequencing have been developed for de novo assembly, variant calling and haplotyping. Another linked-read technology, single tube long fragment read (stLFR), has recently provided a low-cost single tube solution that can enable long fragment data. However, no existing software is available for human diploid assembly and variant calls. We develop Aquila stLFR to adapt to the key characteristics of stLFR. Aquila stLFR assembles near perfect diploid assembled contigs, and the assembly-based variant calling shows that Aquila stLFR detects large numbers of structural variants which were not easily spanned by Illumina short-reads. Furthermore, the hybrid assembly mode Aquila hybrid allows a hybrid assembly based on both stLFR and 10x linked-reads libraries, demonstrating that these two technologies can always be complementary to each other for assembly to improve contiguity and the variants detection, regardless of assembly quality of the library itself from single sequencing technology. The overlapped structural variants (SVs) from two independent sequencing data of the same individual, and the SVs from hybrid assemblies provide us a high-confidence profile to study them.AvailabilitySource code and documentation are available on https://github.com/maiziex/Aquila_stLFR.

Download Full-text

SvABA: Genome-wide detection of structural variants and indels by local assembly

10.1101/105080 ◽

2017 ◽

Cited By ~ 9

Author(s):

Jeremiah Wala ◽

Pratiti Bandopadhayay ◽

Noah Greenwald ◽

Ryan O’Rourke ◽

Ted Sharpe ◽

...

Keyword(s):

Variant Calling ◽

Accurate Method ◽

Structural Variants ◽

Sequencing Data ◽

Cancer Driver ◽

Insertion And Deletion ◽

Genome Wide ◽

Cancer Genomes ◽

Local Assembly ◽

Genomic Regions

AbstractStructural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at-scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs, and substantially improved detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (< 1,000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types, and found that templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized SVs.

Download Full-text

Whisper: Read sorting allows robust mapping of sequencing data

10.1101/240358 ◽

2017 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Debudaj-Grabysz ◽

Adam Gudyś ◽

Szymon Grabowski

Keyword(s):

Reference Genome ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Suffix Arrays ◽

Link Type ◽

Mapping Tool ◽

Reverse Complement ◽

Comparable Accuracy

AbstractMotivationMapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily.ResultsWe present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy (validated in variant calling pipeline).AvailabilityWhisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/[email protected] informationSupplementary data are available at publisher Web site.

Download Full-text

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

10.1101/2019.12.29.890418 ◽

2019 ◽

Cited By ~ 1

Author(s):

Umair Ahsan ◽

Qian Liu ◽

Li Fang ◽

Kai Wang

Keyword(s):

Deep Neural Network ◽

Deep Neural Networks ◽

Variant Calling ◽

Sequencing Data ◽

Long Reads ◽

Novel Variants ◽

Long Read ◽

Variant Detection ◽

Genomic Regions ◽

Haplotype Information

AbstractVariant (SNPs/indels) detection from high-throughput sequencing data remains an important yet unresolved problem. Long-read sequencing enables variant detection in difficult-to-map genomic regions that short-read sequencing cannot reliably examine (for example, only ~80% of genomic regions are marked as “high-confidence region” to have SNP/indel calls in the Genome In A Bottle project); however, the high per-base error rate poses unique challenges in variant detection. Existing methods on long-read data typically rely on analyzing pileup information from neighboring bases surrounding a candidate variant, similar to short-read variant callers, yet the benefits of much longer read length are not fully exploited. Here we present a deep neural network called NanoCaller, which detects SNPs by examining pileup information solely from other nonadjacent candidate SNPs that share the same long reads using long-range haplotype information. With called SNPs by NanoCaller, NanoCaller phases long reads and performs local realignment on two sets of phased reads to call indels by another deep neural network. Extensive evaluation on 5 human genomes (sequenced by Nanopore and PacBio long-read techniques) demonstrated that NanoCaller greatly improved performance in difficult-to-map regions, compared to other long-read variant callers. We experimentally validated 41 novel variants in difficult-to-map regions in a widely-used benchmarking genome, which cannot be reliably detected previously. We extensively evaluated the run-time characteristics and the sensitivity of parameter settings of NanoCaller to different characteristics of sequencing data. Finally, we achieved the best performance in Nanopore-based variant calling from MHC regions in the PrecisionFDA Variant Calling Challenge on Difficult-to-Map Regions by ensemble calling. In summary, by incorporating haplotype information in deep neural networks, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing data.

Download Full-text

Building a Chinese pan-genome of 486 individuals

Communications Biology ◽

10.1038/s42003-021-02556-6 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Qiuhui Li ◽

Shilin Tian ◽

Bin Yan ◽

Chi Man Liu ◽

Tak-Wah Lam ◽

...

Keyword(s):

Genome Sequence ◽

Dna Sequences ◽

Reference Genome ◽

Variant Calling ◽

Han Chinese ◽

Pan Genome ◽

The Common ◽

Reference Sequences ◽

Genomic Regions ◽

Common Sequence

AbstractPan-genome sequence analysis of human population ancestry is critical for expanding and better defining human genome sequence diversity. However, the amount of genetic variation still missing from current human reference sequences is still unknown. Here, we used 486 deep-sequenced Han Chinese genomes to identify 276 Mbp of DNA sequences that, to our knowledge, are absent in the current human reference. We classified these sequences into individual-specific and common sequences, and propose that the common sequence size is uncapped with a growing population. The 46.646 Mbp common sequences obtained from the 486 individuals improved the accuracy of variant calling and mapping rate when added to the reference genome. We also analyzed the genomic positions of these common sequences and found that they came from genomic regions characterized by high mutation rate and low pathogenicity. Our study authenticates the Chinese pan-genome as representative of DNA sequences specific to the Han Chinese population missing from the GRCh38 reference genome and establishes the newly defined common sequences as candidates to supplement the current human reference.

Download Full-text

LEVIATHAN: efficient discovery of large structural variants by leveraging long-range information from Linked-Reads data

10.1101/2021.03.25.437002 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Fabrice Legeai ◽

Claire Lemaitre

Keyword(s):

Long Range ◽

Model Organism ◽

Variant Calling ◽

Simulated Data ◽

Structural Variants ◽

Sequencing Data ◽

High Quality ◽

Short Reads ◽

Structural Variant ◽

Human Data

Linked-Reads technologies, popularized by 10x Genomics, combine the high- quality and low cost of short-reads sequencing with a long-range information by adding barcodes that tag reads originating from the same long DNA fragment. Thanks to their high-quality and long-range information, such reads are thus particularly useful for various applications such as genome scaffolding and structural variant calling. As a result, multiple structural variant calling methods were developed within the last few years. However, these methods were mainly tested on human data, and do not run well on non-human organisms, for which reference genomes are highly fragmented, or sequencing data display high levels of heterozygosity. Moreover, even on human data, most tools still require large amounts of computing resources. We present LEVIATHAN, a new structural variant calling tool that aims to address these issues, and especially better scale and apply to a wide variety of organisms. Our method relies on a barcode index, that allows to quickly compare the similarity of all possible pairs of regions in terms of amount of common barcodes. Region pairs sharing a sufficient number of barcodes are then considered as potential structural variants, and complementary, classical short reads methods are applied to further refine the breakpoint coordinates. Our experiments on simulated data underline that our method compares well to the state-of-the-art, both in terms of recall and precision, and also in terms of resource consumption. Moreover, LEVIATHAN was successfully applied to a real dataset from a non-model organism, while all other tools either failed to run or required unreasonable amounts of resources. LEVIATHAN is implemented in C++, supported on Linux platforms, and available under AGPL-3.0 License at https://github.com/morispi/LEVIATHAN.

Download Full-text

Rapid methicillin resistance diversification in Staphylococcus epidermidis colonizing human neonates

Nature Communications ◽

10.1038/s41467-021-26392-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Manoshi S. Datta ◽

Idan Yelin ◽

Ori Hochwald ◽

Imad Kassis ◽

Liron Borenstein-Levin ◽

...

Keyword(s):

Staphylococcus Epidermidis ◽

Methicillin Resistance ◽

Point Mutations ◽

Gene Content ◽

Whole Genome Sequence ◽

Patient Specific ◽

Gene Gain ◽

Nucleotide Polymorphisms ◽

Structural Variants ◽

Bacterial Strains

AbstractEarly in life, infants are colonized with multiple bacterial strains whose differences in gene content can have important health consequences. Metagenomics-based approaches have revealed gene content differences between different strains co-colonizing newborns, but less is known about the rate, mechanism, and phenotypic consequences of gene content diversification within strains. Here, focusing on Staphylococcus epidermidis, we whole-genome sequence and phenotype more than 600 isolates from newborns. Within days of birth, infants are co-colonized with a highly personalized repertoire of S. epidermidis strains, which are spread across the newborn body. Comparing the genomes of multiple isolates of each strain, we find very little evidence of adaptive evolution via single-nucleotide polymorphisms. By contrast, we observe gene content differences even between otherwise genetically identical cells, including variation of the clinically important methicillin resistance gene, mecA, suggesting rapid gene gain and loss events at rates higher than point mutations. Mapping the genomic architecture of structural variants by long-read Nanopore sequencing, we find that deleted regions were always flanked by direct repeats, consistent with site-specific recombination. However, we find that even within a single genetic background, recombination occurs at multiple, often non-canonical repeats, leading to the rapid evolution of patient-specific diverse structural variants in the SCCmec island and to differences in antibiotic resistance.

Download Full-text

Standard operating procedure for somatic variant refinement of tumor sequencing data

10.1101/266262 ◽

2018 ◽

Cited By ~ 1

Author(s):

Erica K. Barnell ◽

Peter Ronning ◽

Katie M. Campbell ◽

Kilannin Krysiak ◽

Benjamin J. Ainscough ◽

...

Keyword(s):

Massively Parallel Sequencing ◽

Variant Calling ◽

Standard Operating Procedure ◽

Sequencing Data ◽

Optimal Method ◽

Somatic Variant ◽

Standard Operating ◽

Variant Detection ◽

Manual Review

AbstractPurposeManual review of aligned sequencing reads is required to develop a high-quality list of somatic variants from massively parallel sequencing data (MPS). Despite widespread use in analyzing MPS data, there has been little attempt to describe methods for manual review, resulting in high inter- and intra-lab variability in somatic variant detection and characterization of tumors.MethodsOpen source software was used to develop an optimal method for manual review setup. We also developed a systemic approach to visually inspect each variant during manual review.ResultsWe present a standard operating procedures for somatic variant refinement for use by manual reviewers. The approach is enhanced through representative examples of 4 different manual review categories that indicate a reviewer’s confidence in the somatic variant call and 19 annotation tags that contextualize commonly observed sequencing patterns during manual review. Representative examples provide detailed instructions on how to classify variants during manual review to rectify lack of confidence in automated somatic variant detection.ConclusionStandardization of somatic variant refinement through systematization of manual review will improve the consistency and reproducibility of identifying true somatic variants after automated variant calling.

Download Full-text