Protocol for variant calling in SARS-Cov-2 enabling long indel detection v1 (protocols.io.btrunm6w)

Reliable detection of somatic variations is of critical importance in cancer research. Lancet is an accurate and sensitive somatic variant caller which detects SNVs and indels by jointly analyzing reads from tumor and matched normal samples using colored DeBruijn graphs. Extensive experimental comparison on synthetic and real whole-genome sequencing datasets demonstrates that Lancet has better accuracy, especially for indel detection, than widely used somatic callers, such as MuTect, MuTect2, LoFreq, Strelka, and Strelka2. Lancet features a reliable variant scoring system which is essential for variant prioritization and detects low frequency mutations without sacrificing the sensitivity to call longer insertions and deletions empowered by the local assembly engine. In addition to genome-wide analysis, Lancet allows inspection of somatic variants in graph space, which augments the traditional read alignment visualization to help confirm a variant of interest. Lancet is available as an open-source program at https://github.com/nygenome/lancet.

Download Full-text

Machine learning-based detection of insertions and deletions in the human genome

10.1101/628222 ◽

2019 ◽

Author(s):

Charles Curnin ◽

Rachel L. Goldfeder ◽

Shruti Marwaha ◽

Devon Bonner ◽

Daryl Waggott ◽

...

Keyword(s):

Machine Learning ◽

Variant Calling ◽

Single Nucleotide ◽

Reading Frame ◽

Insertions And Deletions ◽

Indel Detection ◽

Novel Approach ◽

Indel Calling ◽

Benchmark Datasets ◽

The Impact

AbstractInsertions and deletions (indels) make a critical contribution to human genetic variation. While indel calling has improved significantly, it lags dramatically in performance relative to single-nucleotide variant calling, something of particular concern for clinical genomics where larger scale disruption of the open reading frame can commonly cause disease. Here, we present a machine learning-based approach to the detection of indel breakpoints called Scotch. This novel approach improves sensitivity to larger variants dramatically by leveraging sequencing metrics and signatures of poor read alignment. We also introduce a meta-analytic indel caller, called Metal, that performs a “smart intersection” of Scotch and currently available tools to be maximally sensitive to large variants. We use new benchmark datasets and Sanger sequencing to compare Scotch and Metal to current gold standard indel callers, achieving unprecedented levels of precision and recall. We demonstrate the impact of these improvements by applying this tool to a cohort of patients with undiagnosed disease, generating plausible novel candidates in 21 out of 26 undiagnosed cases. We highlight the diagnosis of one patient with a 498-bp deletion in HNRNPA1 missed by traditional indel-detection tools.

Download Full-text

Effect of lossy compression of quality scores on variant calling

10.1101/029843 ◽

2015 ◽

Cited By ~ 1

Author(s):

Idoia Ochoa ◽

Mikel Hernaez ◽

Rachel Goldfeder ◽

Tsachy Weissman ◽

Euan Ashley

Keyword(s):

Dna Sequencing ◽

Consensus Sequence ◽

Variant Calling ◽

Simulated Data ◽

Genomic Data ◽

Original Data ◽

Lossy Compression ◽

Sequencing Data ◽

Indel Detection ◽

The Cost

Recent advancements in sequencing technology have led to a drastic reduction in the cost of genome sequencing. This development has generated an unprecedented amount of genomic data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear. Bioinformatic algorithms to identify SNPs and INDELs from next-generation DNA sequencing data use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. We analyze several lossy compressors introduced recently in the literature. Specifically, we investigate how the output of the variant caller when using the original data (uncompressed) differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets such as the GIAB (Genome In A Bottle) consensus sequence for NA12878 and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the uncompressed data. Further, in some cases lossy compression can lead to variant calling performance which is superior to that using the uncompressed file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors. The \emph{Supplementary Data} can be found at \url{http://web.stanford.edu/~iochoa/supplementEffectLossy.zip}.

Download Full-text

Performance Assessment of Variant Calling Pipelines using Human Whole Exome Sequencing and Simulated data

10.1101/359109 ◽

2018 ◽

Author(s):

Manojkumar Kumaran ◽

Umadevi Subramanian ◽

Bharanidharan Devarajan

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Reference Genome ◽

Variant Calling ◽

Simulated Data ◽

Variant Call ◽

Human Reference Genome ◽

Indel Detection ◽

Whole Exome ◽

Clinical Variants

AbstractThe whole exome sequencing (WES) is a time-consuming technology in the identification of clinical variants and it demands the accurate variant caller tools. The currently available tools compromise accuracy in predicting the specific types of variants. Thus, it is important to find out the possible combination of best aligner-variant caller tools for detecting SNVs and InDels separately. Moreover, many important aspects of InDel detection are not overlooked while comparing the performance of tools. One such aspect is the detection of InDels with respect to base pair length. To assess the performance of variant (especially InDels) caller in combination with different aligners, 20 automated pipelines were developed and evaluated using gold reference variant dataset (NA12878) from Genome in a Bottle (GiaB) consortium of human whole exome sequencing. Additionally, the simulated exome data from two human reference genome sequences (GRCh37 and GRCh38) were used to compare the performance of the pipelines. By analyzing various performance metrices, we observed that BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for Indels. Altogether, DeepVariant with BWA and Novoalign performed best. Further, we showed that merging the top performing pipelines improved the accurate variant call set. Collectively, this study would help the investigators to effectively improve the sensitivity and accuracy in detecting specific variants.

Download Full-text

Improved indel detection in DNA and RNA via realignment with ABRA2

Bioinformatics ◽

10.1093/bioinformatics/btz033 ◽

2019 ◽

Vol 35 (17) ◽

pp. 2966-2973 ◽

Cited By ~ 12

Author(s):

Lisle E Mose ◽

Charles M Perou ◽

Joel S Parker

Keyword(s):

Variant Calling ◽

Substantial Improvement ◽

Supplementary Information ◽

Common Source ◽

Data Types ◽

Indel Detection ◽

Dna And Rna ◽

Wide Range ◽

Whole Genomes ◽

Variant Detection

Abstract Motivation Genomic variant detection from next-generation sequencing has become established as an extremely important component of research and clinical diagnoses in both cancer and Mendelian disorders. Insertions and deletions (indels) are a common source of variation and can frequently impact functionality, thus making their detection vitally important. While substantial effort has gone into detecting indels from DNA, there is still opportunity for improvement. Further, detection of indels from RNA-Seq data has largely been an afterthought and offers another critical area for variant detection. Results We present here ABRA2, a redesign of the original ABRA implementation that offers support for realignment of both RNA and DNA short reads. The process results in improved accuracy and scalability including support for human whole genomes. Results demonstrate substantial improvement in indel detection for a variety of data types, including those that were not previously supported by ABRA. Further, ABRA2 results in broad improvements to variant calling accuracy across a wide range of post-processing workflows including whole genomes, targeted exomes and transcriptome sequencing. Availability and implementation ABRA2 is implemented in a combination of Java and C/C++ and is freely available to all from: https://github.com/mozack/abra2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Whole Genome Sequencing Refines Knowledge on the Population Structure of Mycobacterium bovis from a Multi-Host Tuberculosis System

Microorganisms ◽

10.3390/microorganisms9081585 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1585

Author(s):

Ana C. Reis ◽

Liliana C. M. Salvador ◽

Suelee Robbe-Austerman ◽

Rogério Tenreiro ◽

Ana Botelho ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Wild Boar ◽

Genome Sequencing ◽

Mycobacterium Bovis ◽

Red Deer ◽

Variable Number Tandem Repeat ◽

Variant Calling ◽

Whole Genome ◽

Network Analyses

Classical molecular analyses of Mycobacterium bovis based on spoligotyping and Variable Number Tandem Repeat (MIRU-VNTR) brought the first insights into the epidemiology of animal tuberculosis (TB) in Portugal, showing high genotypic diversity of circulating strains that mostly cluster within the European 2 clonal complex. Previous surveillance provided valuable information on the prevalence and spatial occurrence of TB and highlighted prevalent genotypes in areas where livestock and wild ungulates are sympatric. However, links at the wildlife–livestock interfaces were established mainly via classical genotype associations. Here, we apply whole genome sequencing (WGS) to cattle, red deer and wild boar isolates to reconstruct the M. bovis population structure in a multi-host, multi-region disease system and to explore links at a fine genomic scale between M. bovis from wildlife hosts and cattle. Whole genome sequences of 44 representative M. bovis isolates, obtained between 2003 and 2015 from three TB hotspots, were compared through single nucleotide polymorphism (SNP) variant calling analyses. Consistent with previous results combining classical genotyping with Bayesian population admixture modelling, SNP-based phylogenies support the branching of this M. bovis population into five genetic clades, three with apparent geographic specificities, as well as the establishment of an SNP catalogue specific to each clade, which may be explored in the future as phylogenetic markers. The core genome alignment of SNPs was integrated within a spatiotemporal metadata framework to further structure this M. bovis population by host species and TB hotspots, providing a baseline for network analyses in different epidemiological and disease control contexts. WGS of M. bovis isolates from Portugal is reported for the first time in this pilot study, refining the spatiotemporal context of TB at the wildlife–livestock interface and providing further support to the key role of red deer and wild boar on disease maintenance. The SNP diversity observed within this dataset supports the natural circulation of M. bovis for a long time period, as well as multiple introduction events of the pathogen in this Iberian multi-host system.

Download Full-text

DEEPGENTM—A Novel Variant Calling Assay for Low Frequency Variants

Genes ◽

10.3390/genes12040507 ◽

2021 ◽

Vol 12 (4) ◽

pp. 507

Author(s):

Bernd Timo Hermann ◽

Sebastian Pfeil ◽

Nicole Groenke ◽

Samuel Schaible ◽

Robert Kunze ◽

...

Keyword(s):

Cancer Detection ◽

Genetic Variants ◽

Liquid Biopsy ◽

Hot Spot ◽

Treatment Success ◽

Low Frequency ◽

Variant Calling ◽

Subsequent Treatment ◽

Precision Oncology ◽

Orthogonal Comparison

Detection of genetic variants in clinically relevant genomic hot-spot regions has become a promising application of next-generation sequencing technology in precision oncology. Effective personalized diagnostics requires the detection of variants with often very low frequencies. This can be achieved by targeted, short-read sequencing that provides high sequencing depths. However, rare genetic variants can contain crucial information for early cancer detection and subsequent treatment success, an inevitable level of background noise usually limits the accuracy of low frequency variant calling assays. To address this challenge, we developed DEEPGENTM, a variant calling assay intended for the detection of low frequency variants within liquid biopsy samples. We processed reference samples with validated mutations of known frequencies (0%–0.5%) to determine DEEPGENTM’s performance and minimal input requirements. Our findings confirm DEEPGENTM’s effectiveness in discriminating between signal and noise down to 0.09% variant allele frequency and an LOD(90) at 0.18%. A superior sensitivity was also confirmed by orthogonal comparison to a commercially available liquid biopsy-based assay for cancer detection.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases

Genetics in Medicine ◽

10.1038/s41436-020-01084-8 ◽

2021 ◽

Author(s):

Shilpa Nadimpalli Kobren ◽

◽

Dustin Baldridge ◽

Matt Velinder ◽

Joel B. Krier ◽

...

Keyword(s):

Online Survey ◽

Variant Calling ◽

Theoretical Method ◽

The United States ◽

Genomic Sequencing ◽

Biomedical Data ◽

Sequencing Data ◽

Multimodal Data ◽

Undiagnosed Diseases ◽

Undiagnosed Diseases Network

Abstract Purpose Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful. Methods We collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols. Results We found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases. Conclusion The largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases.

Download Full-text

Clinical-grade whole-genome sequencing and 3′ transcriptome analysis of colorectal cancer patients

Genome Medicine ◽

10.1186/s13073-021-00852-8 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Agata Stodolna ◽

Miao He ◽

Mahesh Vasipalli ◽

Zoya Kingsbury ◽

Jennifer Becq ◽

...

Keyword(s):

Colorectal Cancer ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Transcriptome Analysis ◽

Variant Calling ◽

Standard Of Care ◽

Genomic Variation ◽

Whole Genome ◽

Clinical Grade ◽

Pathway Gene

Abstract Background Clinical-grade whole-genome sequencing (cWGS) has the potential to become the standard of care within the clinic because of its breadth of coverage and lack of bias towards certain regions of the genome. Colorectal cancer presents a difficult treatment paradigm, with over 40% of patients presenting at diagnosis with metastatic disease. We hypothesised that cWGS coupled with 3′ transcriptome analysis would give new insights into colorectal cancer. Methods Patients underwent PCR-free whole-genome sequencing and alignment and variant calling using a standardised pipeline to output SNVs, indels, SVs and CNAs. Additional insights into the mutational signatures and tumour biology were gained by the use of 3′ RNA-seq. Results Fifty-four patients were studied in total. Driver analysis identified the Wnt pathway gene APC as the only consistently mutated driver in colorectal cancer. Alterations in the PI3K/mTOR pathways were seen as previously observed in CRC. Multiple private CNAs, SVs and gene fusions were unique to individual tumours. Approximately 30% of patients had a tumour mutational burden of > 10 mutations/Mb of DNA, suggesting suitability for immunotherapy. Conclusions Clinical whole-genome sequencing offers a potential avenue for the identification of private genomic variation that may confer sensitivity to targeted agents and offer patients new options for targeted therapies.

Download Full-text