UNMASC: tumor-only variant calling with unmatched normal controls

Abstract Despite years of progress, mutation detection in cancer samples continues to require significant manual review as a final step. Expert review is particularly challenging in cases where tumors are sequenced without matched normal control DNA. Attempts have been made to call somatic point mutations without a matched normal sample by removing well-known germline variants, utilizing unmatched normal controls, and constructing decision rules to classify sequencing errors and private germline variants. With budgetary constraints related to computational and sequencing costs, finding the appropriate number of controls is a crucial step to identifying somatic variants. Our approach utilizes public databases for canonical somatic variants as well as germline variants and leverages information gathered about nearby positions in the normal controls. Drawing from our cohort of targeted capture panel sequencing of tumor and normal samples with varying tumortypes and demographics, these served as a benchmark for our tumor-only variant calling pipeline to observe the relationship between our ability to correctly classify variants against a number of unmatched normals. With our benchmarked samples, approximately ten normal controls were needed to maintain 94% sensitivity, 99% specificity and 76% positive predictive value, far outperforming comparable methods. Our approach, called UNMASC, also serves as a supplement to traditional tumor with matched normal variant calling workflows and can potentially extend to other concerns arising from analyzing next generation sequencing data.

Download Full-text

Joint analysis of matched tumor samples with varying tumor contents improves somatic variant calling in the absence of a germline sample

10.1101/364943 ◽

2018 ◽

Cited By ~ 1

Author(s):

Rebecca F. Halperin ◽

Winnie S. Liang ◽

Sidharth Kulkarni ◽

Erica E. Tassone ◽

Jonathan Adkins ◽

...

Keyword(s):

Normal Tissue ◽

Variant Calling ◽

Joint Analysis ◽

Normal Sample ◽

Adjacent Normal Tissue ◽

Sequencing Data ◽

Somatic Variant ◽

Tissue Samples ◽

Germline Variants ◽

Two Samples

AbstractArchival tumor samples represent a potential rich resource of annotated specimens for translational genomics research. However, standard variant calling approaches require a matched normal sample from the same individual, which is often not available in the retrospective setting, making it difficult to distinguish between true somatic variants and germline variants that are private to the individual. Archival sections often contain adjacent normal tissue, but this normal tissue can include infiltrating tumor cells. Comparative somatic variant callers are designed to exclude variants present in the normal sample, so a novel approach is required to leverage sequencing of adjacent normal tissue for somatic variant calling. Here we present LumosVar 2.0, a software package designed to jointly analyze multiple samples from the same patient. The approach is based on the concept that the allelic fraction of somatic variants, but not germline variants, would be reduced in samples with low tumor content. LumosVar 2.0 estimates allele specific copy number and tumor sample fractions from the data, and uses the model to determine expected allelic fractions for somatic and germline variants and classify variants accordingly. To evaluate using LumosVar 2.0 to jointly call somatic variants with tumor and adjacent normal samples, we used a glioblastoma dataset with matched high tumor content, low tumor content, and germline exome sequencing data (to define true somatic variants) available for each patient. We show that both sensitivity and positive predictive value are improved by analyzing the high tumor and low tumor samples jointly compared to analyzing the samples individually or compared to in-silico pooling of the two samples. Finally, we applied this approach to a set of breast and prostate archival tumor samples for which normal samples were not available for germline sequencing, but tumor blocks containing adjacent normal tissue were available for sequencing. Joint analysis using LumosVar 2.0 detected several variants, including known cancer hotspot mutations that were not detected by standard somatic variant calling tools using the adjacent normal as a reference. Together, these results demonstrate the potential utility of leveraging paired tissue samples to improve somatic variant calling when a constitutional DNA sample is not available.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

Introduction to radiomics and radiogenomics in neuro-oncology: implications and challenges

Neuro-Oncology Advances ◽

10.1093/noajnl/vdaa148 ◽

2020 ◽

Vol 2 (Supplement_4) ◽

pp. iv3-iv14

Author(s):

Niha Beig ◽

Kaustav Bera ◽

Pallavi Tiwari

Keyword(s):

Treatment Response ◽

A Priori ◽

Point Mutations ◽

Region Of Interest ◽

Next Generation Sequencing Data ◽

Tumor Segmentation ◽

Sequencing Data ◽

Statistical Correlations ◽

Data Driven Approach ◽

Mri Scans

Abstract Neuro-oncology largely consists of malignancies of the brain and central nervous system including both primary as well as metastatic tumors. Currently, a significant clinical challenge in neuro-oncology is to tailor therapies for patients based on a priori knowledge of their survival outcome or treatment response to conventional or experimental therapies. Radiomics or the quantitative extraction of subvisual data from conventional radiographic imaging has recently emerged as a powerful data-driven approach to offer insights into clinically relevant questions related to diagnosis, prediction, prognosis, as well as assessing treatment response. Furthermore, radiogenomic approaches provide a mechanism to establish statistical correlations of radiomic features with point mutations and next-generation sequencing data to further leverage the potential of routine MRI scans to serve as “virtual biopsy” maps. In this review, we provide an introduction to radiomic and radiogenomic approaches in neuro-oncology, including a brief description of the workflow involving preprocessing, tumor segmentation, and extraction of “hand-crafted” features from the segmented region of interest, as well as identifying radiogenomic associations that could ultimately lead to the development of reliable prognostic and predictive models in neuro-oncology applications. Lastly, we discuss the promise of radiomics and radiogenomic approaches in personalizing treatment decisions in neuro-oncology, as well as the challenges with clinical adoption, which will rely heavily on their demonstrated resilience to nonstandardization in imaging protocols across sites and scanners, as well as in their ability to demonstrate reproducibility across large multi-institutional cohorts.

Download Full-text

A unified haplotype-based method for accurate and comprehensive variant calling

10.1101/456103 ◽

2018 ◽

Cited By ~ 3

Author(s):

Daniel P Cooke ◽

David C Wedge ◽

Gerton Lunter

Keyword(s):

De Novo ◽

Variant Calling ◽

Normal Sample ◽

Sequencing Data ◽

Somatic Variation ◽

Data Set ◽

Small Complex ◽

Physical Linkage ◽

Germline Variation ◽

Almost All

Haplotype-based variant callers, which consider physical linkage between variant sites, are currently among the best tools for germline variation discovery and genotyping from short-read sequencing data. However, almost all such tools were designed specifically for detecting common germline variation in diploid populations, and give sub-optimal results in other scenarios. Here we present Octopus, a versatile haplotype-based variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. We show that Octopus accurately calls de novo mutations in parent-offspring trios and germline variants in individuals, including SNVs, indels, and small complex replacements such as microinversions. In addition, using a carefully designed synthetic-tumour data set derived from clean sequencing data from a sample with known germline haplotypes, and observed mutations in large cohort of tumour samples, we show that Octopus accurately characterizes germline and somatic variation in tumours, both with and without a paired normal sample. Sequencing reads and prior information are combined to phase called genotypes of arbitrary ploidy, including those with somatic mutations. Octopus also outputs realigned evidence BAMs to aid validation and interpretation.

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

BMC Bioinformatics ◽

10.1186/s12859-020-03740-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Aranka Steyaert ◽

Pieter Audenaert ◽

Jan Fostier

Keyword(s):

Genomic Sequence ◽

Conditional Random Field ◽

Accurate Determination ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

Expectation Maximisation ◽

De Bruijn

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Download Full-text

Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism

10.1101/130732 ◽

2017 ◽

Author(s):

Jade C.S. Chung ◽

Swaine L. Chen

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Quality Score ◽

Identification Accuracy ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Data ◽

Base Quality Score ◽

Generation Sequencing

AbstractNext-generation sequencing data is accompanied by quality scores that quantify sequencing error. Inaccuracies in these quality scores propagate through all subsequent analyses; thus base quality score recalibration is a standard step in many next-generation sequencing workflows, resulting in improved variant calls. Current base quality score recalibration algorithms rely on the assumption that sequencing errors are already known; for human resequencing data, relatively complete variant databases facilitate this. However, because existing databases are still incomplete, recalibration is still inaccurate; and most organisms do not have variant databases, exacerbating inaccuracy for non-human data. To overcome these logical and practical problems, we introduce Lacer, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants. Lacer is the first logically sound, fully general, and truly accurate base recalibrator. Lacer enhances variant identification accuracy for resequencing data of human as well as other organisms (which are not accessible to current recalibrators), simultaneously improving and extending the benefits of base quality score recalibration to nearly all ongoing sequencing projects. Lacer is available at: https://github.com/swainechen/lacer.

Download Full-text

DeNovoCNN: A deep learning approach to de novo variant calling in next generation sequencing data

10.1101/2021.09.20.461072 ◽

2021 ◽

Author(s):

Gelana Khazeeva ◽

Karolis Sablauskas ◽

Bart van der Sanden ◽

Wouter Steyaert ◽

Michael Kwint ◽

...

Keyword(s):

Exome Sequencing ◽

De Novo ◽

Genetic Disorders ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Accurate Identification ◽

Whole Exome ◽

De Novo Variant ◽

Generation Sequencing

De novo mutations (DNMs) are an important cause of genetic disorders. The accurate identification of DNMs from sequencing data is therefore fundamental to rare disease research and diagnostics. Unfortunately, identifying reliable DNMs remains a major challenge due to sequence errors, uneven coverage, and mapping artifacts. Here, we developed a deep convolutional neural network (CNN) DNM caller (DeNovoCNN), that encodes alignment of sequence reads for a trio as 160×164 resolution images. DeNovoCNN was trained on DNMs of whole exome sequencing (WES) of 2003 trios achieving on average 99.2% recall and 93.8% precision. We find that DeNovoCNN has increased recall/sensitivity and precision compared to existing de novo calling approaches (GATK, DeNovoGear, Samtools) based on the Genome in a Bottle reference dataset. Sanger validations of DNMs called in both exome and genome datasets confirm that DeNovoCNN outperforms existing methods. Most importantly, we show that DeNovoCNN is robust against different exome sequencing and analyses approaches, thereby allowing it to be applied on other datasets. DeNovoCNN is freely available and can be run on existing alignment (BAM/CRAM) and variant calling (VCF) files from WES and WGS without a need for variant recalling.

Download Full-text

Pisces: An Accurate and Versatile Variant Caller for Somatic and Germline Next-Generation Sequencing Data

10.1101/291641 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tamsen Dunn ◽

Gwenn Berry ◽

Dorothea Emig-Agius ◽

Yu Jiang ◽

Serena Lei ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gene Mutations ◽

Variant Calling ◽

Amplicon Sequencing ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ras Gene ◽

Generation Sequencing

AbstractMotivationNext-Generation Sequencing (NGS) technology is transitioning quickly from research labs to clinical settings. The diagnosis and treatment selection for many acquired and autosomal conditions necessitate a method for accurately detecting somatic and germline variants, suitable for the clinic.ResultsWe have developed Pisces, a rapid, versatile and accurate small variant calling suite designed for somatic and germline amplicon sequencing applications. Pisces accuracy is achieved by four distinct modules, the Pisces Read Stitcher, Pisces Variant Caller, the Pisces Variant Quality Recalibrator, and the Pisces Variant Phaser. Each module incorporates a number of novel algorithmic strategies aimed at reducing noise or increasing the likelihood of detecting a true variant.AvailabilityPisces is distributed under an open source license and can be downloaded from https://github.com/Illumina/Pisces. Pisces is available on the BaseSpace™ SequenceHub as part of the TruSeq Amplicon workflow and the Illumina Ampliseq Workflow. Pisces is distributed on Illumina sequencing platforms such as the MiSeq™, and is included in the Praxis™ Extended RAS Panel test which was recently approved by the FDA for the detection of multiple RAS gene [email protected] informationSupplementary data are available online.

Download Full-text

NGSphy: phylogenomic simulation of next-generation sequencing data

10.1101/197715 ◽

2017 ◽

Author(s):

Merly Escalona ◽

Sara Rocha ◽

David Posada

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Gene Families ◽

Common Species ◽

Next Generation Sequencing Data ◽

Phylogenomic Analysis ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Technologies ◽

Generation Sequencing

AbstractMotivationAdvances in sequencing technologies have made it feasible to obtain massive datasets for phylogenomic inference, often consisting of large numbers of loci from multiple species and individuals. The phylogenomic analysis of next-generation sequencing (NGS) data implies a complex computational pipeline where multiple technical and methodological decisions are necessary that can influence the final tree obtained, like those related to coverage, assembly, mapping, variant calling and/or phasing.ResultsTo assess the influence of these variables we introduce NGSphy, an open-source tool for the simulation of Illumina reads/read counts obtained from haploid/diploid individual genomes with thousands of independent gene families evolving under a common species tree. In order to resemble real NGS experiments, NGSphy includes multiple options to model sequencing coverage (depth) heterogeneity across species, individuals and loci, including off-target or uncaptured loci. For comprehensive simulations covering multiple evolutionary scenarios, parameter values for the different replicates can be sampled from user-defined statistical distributions.AvailabilitySource code, full documentation and tutorials including a quick start guide are available at http://github.com/merlyescalona/[email protected]. [email protected]

Download Full-text