Accurate Reference-Free Somatic Variant-Calling by Integrating Genomic, Sequencing and Population Data

Mapping Intimacies ◽

10.1101/383703 ◽

2018 ◽

Author(s):

Ren X. Sun ◽

Christopher M. Lalansingh ◽

Shadrielle Melijah G. Espiritu ◽

Cindy Q. Yao ◽

Takafumi N. Yamaguchi ◽

...

Keyword(s):

Sequence Data ◽

Human Cancer ◽

Variant Calling ◽

Population Data ◽

Single Nucleotide Variants ◽

Reference Tissue ◽

Sequencing Technologies ◽

Distant Tissue ◽

Tumor Types ◽

Reference Samples

ABSTRACTThe detection of somatic single nucleotide variants (SNVs) is critical in both research and clinical applications. Studies of human cancer typically use matched normal (reference) samples from a distant tissue to increase SNV prediction accuracy. This process both doubles sequencing costs and poses challenges when reference samples are not readily available, such as for many cell-lines. To address these challenges, we created S22S: an approach for the prediction of somatic mutations without need for matched reference tissue. S22S takes underlying sequence data, augments them with genomic background context and population frequency information, and classifies SNVs as somatic or non-somatic. We validated S22S using primary tumor/normal pairs from four tumor types, spanning two different sequencing technologies. S22S robustly identifies somatic SNVs, with the area under the precision recall curve reaching 0.97 in kidney clear cell carcinoma, comparable to the best tumor/normal analysis pipelines. S22S is freely available at http://labs.oicr.on.ca/Boutros-lab/software/s22s.

Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

10.1101/2021.03.04.433952 ◽

2021 ◽

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Pi-Chuan Chang ◽

Maria Nattestad ◽

Alexey Kolesnikov ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

High Accuracy ◽

Superior Performance ◽

Read Length ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Long Read

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).

Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes

Briefings in Bioinformatics ◽

10.1093/bib/bbaa083 ◽

2020 ◽

Author(s):

Xiaoyu He ◽

Shanyu Chen ◽

Ruilin Li ◽

Xinyin Han ◽

Zhipeng He ◽

...

Keyword(s):

Genome Sequencing ◽

High Throughput Sequencing ◽

Cancer Genomics ◽

Sequence Data ◽

Human Cancer ◽

Management Strategies ◽

Variant Calling ◽

Cancer Genome ◽

Sequencing Data ◽

Cancer Genome Sequencing

Abstract Next-generation sequencing (NGS) technology has revolutionised human cancer research, particularly via detection of genomic variants with its ultra-high-throughput sequencing and increasing affordability. However, the inundation of rich cancer genomics data has resulted in significant challenges in its exploration and translation into biological insights. One of the difficulties in cancer genome sequencing is software selection. Currently, multiple tools are widely used to process NGS data in four stages: raw sequence data pre-processing and quality control (QC), sequence alignment, variant calling and annotation and visualisation. However, the differences between these NGS tools, including their installation, merits, drawbacks and application, have not been fully appreciated. Therefore, a systematic review of the functionality and performance of NGS tools is required to provide cancer researchers with guidance on software and strategy selection. Another challenge is the multidimensional QC of sequencing data because QC can not only report varied sequence data characteristics but also reveal deviations in diverse features and is essential for a meaningful and successful study. However, monitoring of QC metrics in specific steps including alignment and variant calling is neglected in certain pipelines such as the ‘Best Practices Workflows’ in GATK. In this review, we investigated the most widely used software for the fundamental analysis and QC of cancer genome sequencing data and provided instructions for selecting the most appropriate software and pipelines to ensure precise and efficient conclusions. We further discussed the prospects and new research directions for cancer genomics.

Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing

Nature Communications ◽

10.1038/s41467-019-12493-y ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 26

Author(s):

Peter Edge ◽

Vikas Bansal

Keyword(s):

Single Molecule ◽

Variant Calling ◽

Small Scale ◽

Whole Genome ◽

Limited Information ◽

Single Nucleotide Variants ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read

Abstract Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads.

Assessing Bos taurus introgression in the UOA Bos indicus assembly

Genetics Selection Evolution ◽

10.1186/s12711-021-00688-1 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Maulana M. Naji ◽

Yuri T. Utsunomiya ◽

Johann Sölkner ◽

Benjamin D. Rosen ◽

Gábor Mészáros

Keyword(s):

Bos Taurus ◽

Sequence Data ◽

Variant Calling ◽

Principal Component ◽

Reference Sequence ◽

Sequencing Analysis ◽

Single Nucleotide Variants ◽

Reference Allele ◽

Brahman Cattle ◽

Reference Genomes

Abstract Background Reference genomes are essential in the analysis of genomic data. As the cost of sequencing decreases, multiple reference genomes are being produced within species to alleviate problems such as low mapping accuracy and reference allele bias in variant calling that can be associated with the alignment of divergent samples to a single reference individual. The latest reference sequence adopted by the scientific community for the analysis of cattle data is ARS_UCD1.2, built from the DNA of a Hereford cow (Bos taurus taurus—B. taurus). A complementary genome assembly, UOA_Brahman_1, was recently built to represent the other cattle subspecies (Bos taurus indicus—B. indicus) from a Brahman cow haplotype to further support analysis of B. indicus data. In this study, we aligned the sequence data of 15 B. taurus and B. indicus breeds to each of these references. Results The alignment of B. taurus individuals against UOA_Brahman_1 detected up to five million more single-nucleotide variants (SNVs) compared to that against ARS_UCD1.2. Similarly, the alignment of B. indicus individuals against ARS_UCD1.2 resulted in one and a half million more SNVs than that against UOA_Brahman_1. The number of SNVs with nearly fixed alternative alleles also increased in the alignments with cross-subspecies. Interestingly, the alignment of B. taurus cattle against UOA_Brahman_1 revealed regions with a smaller than expected number of counts of SNVs with nearly fixed alternative alleles. Since B. taurus introgression represents on average 10% of the genome of Brahman cattle, we suggest that these regions comprise taurine DNA as opposed to indicine DNA in the UOA_Brahman_1 reference genome. Principal component and admixture analyses using genotypes inferred from this region support these taurine-introgressed loci. Overall, the flagged taurine segments represent 13.7% of the UOA_Brahman_1 assembly. The genes located within these segments were previously reported to be under positive selection in Brahman cattle, and include functional candidate genes implicated in feed efficiency, development and immunity. Conclusions We report a list of taurine segments that are in the UOA_Brahman_1 assembly, which will be useful for the interpretation of interesting genomic features (e.g., signatures of selection, runs of homozygosity, increased mutation rate, etc.) that could appear in future re-sequencing analysis of indicine cattle.

Longshot: accurate variant calling in diploid genomes using single-molecule long read sequencing

10.1101/564443 ◽

2019 ◽

Cited By ~ 1

Author(s):

Peter Edge ◽

Vikas Bansal

Keyword(s):

Single Molecule ◽

Variant Calling ◽

Whole Genome ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Reads ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Accurate Detection ◽

Oxford Nanopore

AbstractShort-read sequencing technologies such as Illumina enable the accurate detection of single nucleotide variants (SNVs) and short insertion/deletion variants in human genomes but are unable to provide information about haplotypes and variants in repetitive regions of the genome. Single-molecule sequencing technologies such as Pacific Biosciences and Oxford Nanopore generate long reads (≥ 10 kb in length) that can potentially address these limitations of short reads. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, that leverages the haplotype information present in SMS reads to enable the accurate detection and phasing of single nucleotide variants in diploid genomes. Using whole-genome Pacific Biosciences data for multiple human individuals, we demonstrate that Longshot achieves very high accuracy for SNV detection (precision ≥0.992 and recall ≥0.96) that is significantly better than existing variant calling methods. Longshot can also call SNVs with good accuracy using whole-genome Oxford Nanopore data. Finally, we demonstrate that it enables the discovery of variants in duplicated regions of the genome that cannot be mapped using short reads. Longshot is freely available at https://github.com/pjedge/longshot.

A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree

10.1101/055541 ◽

2016 ◽

Cited By ~ 18

Author(s):

Michael A. Eberle ◽

Epameinondas Fritzilas ◽

Peter Krusche ◽

Morten Källberg ◽

Benjamin L. Moore ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Objective Assessment ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Dataset ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genome Wide ◽

Transmission Information

AbstractImprovement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalogue of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of seventeen individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased “platinum” variant catalogue of 4.7 million single nucleotide variants (SNVs) plus 0.7 million small (1-50bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and eleven children of this pedigree. Platinum genotypes are highly concordant with the current catalogue of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%), and add a validated truth catalogue that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission (“non-platinum”) revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.

Unsuspected somatic mosaicism for FBN1 gene contributes to Marfan syndrome

Genetics in Medicine ◽

10.1038/s41436-020-01078-6 ◽

2021 ◽

Author(s):

Pauline Arnaud ◽

Hélène Morel ◽

Olivier Milleron ◽

Laurent Gouya ◽

Christine Francannet ◽

...

Keyword(s):

Marfan Syndrome ◽

Somatic Mosaicism ◽

Variant Calling ◽

Copy Number Variations ◽

Pathogenic Variant ◽

Single Nucleotide Variants ◽

Bioinformatics Analyses ◽

Single Nucleotide ◽

Fbn1 Gene ◽

Pathogenic Variants

Abstract Purpose Individuals with mosaic pathogenic variants in the FBN1 gene are mainly described in the course of familial screening. In the literature, almost all these mosaic individuals are asymptomatic. In this study, we report the experience of our team on more than 5,000 Marfan syndrome (MFS) probands. Methods Next-generation sequencing (NGS) capture technology allowed us to identify five cases of MFS probands who harbored a mosaic pathogenic variant in the FBN1 gene. Results These five sporadic mosaic probands displayed classical features usually seen in Marfan syndrome. Combined with the results of the literature, these rare findings concerned both single-nucleotide variants and copy-number variations. Conclusion This underestimated finding should not be overlooked in the molecular diagnosis of MFS patients and warrants an adaptation of the parameters used in bioinformatics analyses. The five present cases of symptomatic MFS probands harboring a mosaic FBN1 pathogenic variant reinforce the fact that apparently asymptomatic mosaic parents should have a complete clinical examination and a regular cardiovascular follow-up. We advise that individuals with a typical MFS for whom no single-nucleotide pathogenic variant or exon deletion/duplication was identified should be tested by NGS capture panel with an adapted variant calling analysis.

Whole genome resequencing and custom genotyping unveil clonal lineages in ‘Malbec’ grapevines (Vitis vinifera L.)

Scientific Reports ◽

10.1038/s41598-021-87445-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Luciano Calderón ◽

Nuria Mauri ◽

Claudio Muñoz ◽

Pablo Carbonell-Bejerano ◽

Laura Bree ◽

...

Keyword(s):

Genetic Diversity ◽

Somatic Mutations ◽

Clonal Propagation ◽

Variant Calling ◽

Vitis Vinifera L ◽

Whole Genome ◽

Single Nucleotide Variants ◽

Genome Resequencing ◽

Diversity Pattern ◽

Whole Genome Resequencing

AbstractGrapevine cultivars are clonally propagated to preserve their varietal attributes. However, genetic variations accumulate due to the occurrence of somatic mutations. This process is anthropically influenced through plant transportation, clonal propagation and selection. Malbec is a cultivar that is well-appreciated for the elaboration of red wine. It originated in Southwestern France and was introduced in Argentina during the 1850s. In order to study the clonal genetic diversity of Malbec grapevines, we generated whole-genome resequencing data for four accessions with different clonal propagation records. A stringent variant calling procedure was established to identify reliable polymorphisms among the analyzed accessions. The latter procedure retrieved 941 single nucleotide variants (SNVs). A reduced set of the detected SNVs was corroborated through Sanger sequencing, and employed to custom-design a genotyping experiment. We successfully genotyped 214 Malbec accessions using 41 SNVs, and identified 14 genotypes that clustered in two genetically divergent clonal lineages. These lineages were associated with the time span of clonal propagation of the analyzed accessions in Argentina and Europe. Our results show the usefulness of this approach for the study of the scarce intra-cultivar genetic diversity in grapevines. We also provide evidence on how human actions might have driven the accumulation of different somatic mutations, ultimately shaping the Malbec genetic diversity pattern.

scSNV: accurate dscRNA-seq SNV co-expression analysis using duplicate tag collapsing

Genome Biology ◽

10.1186/s13059-021-02364-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gavin W. Wilson ◽

Mathieu Derouet ◽

Gail E. Darling ◽

Jonathan C. Yeung

Keyword(s):

Genetic Variants ◽

False Positive ◽

Variant Calling ◽

Call Rate ◽

Rna Seq ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Variant Call ◽

Two Samples ◽

Co Detection

AbstractIdentifying single nucleotide variants has become common practice for droplet-based single-cell RNA-seq experiments; however, presently, a pipeline does not exist to maximize variant calling accuracy. Furthermore, molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression. Herein, we introduce scSNV designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. We demonstrate that scSNV is fast, with a reduced false-positive variant call rate, and enables the co-detection of genetic variants and A>G RNA edits across twenty-two samples.

Structural variant detection in cancer genomes: computational challenges and perspectives for precision oncology

npj Precision Oncology ◽

10.1038/s41698-021-00155-6 ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Ianthe A. E. M. van Belzen ◽

Alexander Schönhuth ◽

Patrick Kemmeren ◽

Jayne Y. Hehir-Kwa

Keyword(s):

Intratumor Heterogeneity ◽

Precision Oncology ◽

Single Nucleotide Variants ◽

Full Spectrum ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Cancer Genomes ◽

Genomic Aberrations

AbstractCancer is generally characterized by acquired genomic aberrations in a broad spectrum of types and sizes, ranging from single nucleotide variants to structural variants (SVs). At least 30% of cancers have a known pathogenic SV used in diagnosis or treatment stratification. However, research into the role of SVs in cancer has been limited due to difficulties in detection. Biological and computational challenges confound SV detection in cancer samples, including intratumor heterogeneity, polyploidy, and distinguishing tumor-specific SVs from germline and somatic variants present in healthy cells. Classification of tumor-specific SVs is challenging due to inconsistencies in detected breakpoints, derived variant types and biological complexity of some rearrangements. Full-spectrum SV detection with high recall and precision requires integration of multiple algorithms and sequencing technologies to rescue variants that are difficult to resolve through individual methods. Here, we explore current strategies for integrating SV callsets and to enable the use of tumor-specific SVs in precision oncology.