SomaticSignatures: Inferring Mutational Signatures from Single Nucleotide Variants

Mutational signatures are patterns in the occurrence of somatic single nucleotide variants (SNVs) that can reflect underlying mutational processes. The SomaticSignatures package provides flexible, interoperable, and easy-to-use tools that identify such signatures in cancer sequencing data. It facilitates large-scale, cross-dataset estimation of mutational signatures, implements existing methods for pattern decomposition, supports extension through user-defined methods and integrates with Bioconductor workflows. The R package SomaticSignatures is available as part of the Bioconductor project (R Core Team, 2014; Gentleman et al., 2004). Its documentation provides additional details on the methodology and demonstrates applications to biological datasets.

Download Full-text

Implications of Genetic Distance to Reference and De Novo Genome Assembly for Clinical Genomics in Africans

10.1101/2020.09.25.20201780 ◽

2020 ◽

Author(s):

Daniel Shriner ◽

Adebowale Adeyemo ◽

Charles Rotimi

Keyword(s):

Genetic Distance ◽

De Novo ◽

Reference Sequence ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

De Novo Genome Assembly ◽

Single Nucleotide ◽

Clinical Genomics ◽

Advantages And Disadvantages ◽

False Discovery

In clinical genomics, variant calling from short-read sequencing data typically relies on a pan-genomic, universal human reference sequence. A major limitation of this approach is that the number of reads that incorrectly map or fail to map increase as the reads diverge from the reference sequence. In the context of genome sequencing of genetically diverse Africans, we investigate the advantages and disadvantages of using a de novo assembly of the read data as the reference sequence in single sample calling. Conditional on sufficient read depth, the alignment-based and assembly-based approaches yielded comparable sensitivity and false discovery rates for single nucleotide variants when benchmarked against a gold standard call set. The alignment-based approach yielded coverage of an additional 270.8 Mb over which sensitivity was lower and the false discovery rate was higher. Although both approaches detected and missed clinically relevant variants, the assembly-based approach identified more such variants than the alignment-based approach. Of particular relevance to individuals of African descent, the assembly-based approach identified four heterozygous genotypes containing the sickle allele whereas the alignment-based approach identified no occurrences of the sickle allele. Variant annotation using dbSNP and gnomAD identified systematic biases in these databases due to underrepresentation of Africans. Using the counts of homozygous alternate genotypes from the alignment-based approach as a measure of genetic distance to the reference sequence GRCh38.p12, we found that the numbers of misassemblies, total variant sites, potentially novel single nucleotide variants (SNVs), and certain variant classes (e.g., splice acceptor variants, stop loss variants, missense variants, synonymous variants, and variants absent from gnomAD) were significantly correlated with genetic distance. In contrast, genomic coverage and other variant classes (e.g., ClinVar pathogenic or likely pathogenic variants, start loss variants, stop gain variants, splice donor variants, incomplete terminal codons, variants with CADD score ≥20) were not correlated with genetic distance. With improvement in coverage, the assembly-based approach can offer a viable alternative to the alignment-based approach, with the advantage that it can obviate the need to generate diverse human reference sequences or collections of alternate scaffolds.

Download Full-text

Highly multiplexed, fast and accurate nanopore sequencing for verification of synthetic DNA constructs and sequence libraries

Synthetic Biology ◽

10.1093/synbio/ysz025 ◽

2019 ◽

Vol 4 (1) ◽

Cited By ~ 4

Author(s):

Andrew Currin ◽

Neil Swainston ◽

Mark S Dunstan ◽

Adrian J Jervis ◽

Paul Mulherin ◽

...

Keyword(s):

Synthetic Biology ◽

Dna Sequencing ◽

Cost Effective ◽

Polymorphism Analysis ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Synthetic Dna ◽

Design Build ◽

Hardware Costs

Abstract Synthetic biology utilizes the Design–Build–Test–Learn pipeline for the engineering of biological systems. Typically, this requires the construction of specifically designed, large and complex DNA assemblies. The availability of cheap DNA synthesis and automation enables high-throughput assembly approaches, which generates a heavy demand for DNA sequencing to verify correctly assembled constructs. Next-generation sequencing is ideally positioned to perform this task, however with expensive hardware costs and bespoke data analysis requirements few laboratories utilize this technology in-house. Here a workflow for highly multiplexed sequencing is presented, capable of fast and accurate sequence verification of DNA assemblies using nanopore technology. A novel sample barcoding system using polymerase chain reaction is introduced, and sequencing data are analyzed through a bespoke analysis algorithm. Crucially, this algorithm overcomes the problem of high-error rate nanopore data (which typically prevents identification of single nucleotide variants) through statistical analysis of strand bias, permitting accurate sequence analysis with single-base resolution. As an example, 576 constructs (6 × 96 well plates) were processed in a single workflow in 72 h (from Escherichia coli colonies to analyzed data). Given our procedure’s low hardware costs and highly multiplexed capability, this provides cost-effective access to powerful DNA sequencing for any laboratory, with applications beyond synthetic biology including directed evolution, single nucleotide polymorphism analysis and gene synthesis.

Download Full-text

Twenty-Five Years of Propagation in Suspension Cell Culture Results in Substantial Alterations of the Arabidopsis Thaliana Genome

Genes ◽

10.3390/genes10090671 ◽

2019 ◽

Vol 10 (9) ◽

pp. 671 ◽

Cited By ~ 2

Author(s):

Pucker ◽

Rückert ◽

Stracke ◽

Viehöver ◽

Kalinowski ◽

...

Keyword(s):

Cell Culture ◽

Arabidopsis Thaliana ◽

Large Scale ◽

Suspension Cell ◽

Reference Sequence ◽

Model Organisms ◽

Suspension Cell Culture ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Haploid Genome Size

Arabidopsis thaliana is one of the best studied plant model organisms. Besides cultivation in greenhouses, cells of this plant can also be propagated in suspension cell culture. At7 is one such cell line that was established about 25 years ago. Here, we report the sequencing and the analysis of the At7 genome. Large scale duplications and deletions compared to the Columbia-0 (Col-0) reference sequence were detected. The number of deletions exceeds the number of insertions, thus indicating that a haploid genome size reduction is ongoing. Patterns of small sequence variants differ from the ones observed between A. thaliana accessions, e.g., the number of single nucleotide variants matches the number of insertions/deletions. RNA-Seq analysis reveals that disrupted alleles are less frequent in the transcriptome than the native ones.

Download Full-text

neoepiscope improves neoepitope prediction with multivariant phasing

Bioinformatics ◽

10.1093/bioinformatics/btz653 ◽

2019 ◽

Vol 36 (3) ◽

pp. 713-720 ◽

Cited By ~ 5

Author(s):

Mary A Wood ◽

Austin Nguyen ◽

Adam J Struck ◽

Kyle Ellrott ◽

Abhinav Nellore ◽

...

Keyword(s):

False Negative ◽

Supplementary Information ◽

Supplementary File ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Somatic Variant ◽

Negative Results ◽

Multiple Datasets ◽

False Negative Results

Abstract Motivation The vast majority of tools for neoepitope prediction from DNA sequencing of complementary tumor and normal patient samples do not consider germline context or the potential for the co-occurrence of two or more somatic variants on the same mRNA transcript. Without consideration of these phenomena, existing approaches are likely to produce both false-positive and false-negative results, resulting in an inaccurate and incomplete picture of the cancer neoepitope landscape. We developed neoepiscope chiefly to address this issue for single nucleotide variants (SNVs) and insertions/deletions (indels). Results Herein, we illustrate how germline and somatic variant phasing affects neoepitope prediction across multiple datasets. We estimate that up to ∼5% of neoepitopes arising from SNVs and indels may require variant phasing for their accurate assessment. neoepiscope is performant, flexible and supports several major histocompatibility complex binding affinity prediction tools. Availability and implementation neoepiscope is available on GitHub at https://github.com/pdxgx/neoepiscope under the MIT license. Scripts for reproducing results described in the text are available at https://github.com/pdxgx/neoepiscope-paper under the MIT license. Additional data from this study, including summaries of variant phasing incidence and benchmarking wallclock times, are available in Supplementary Files 1, 2 and 3. Supplementary File 1 contains Supplementary Table 1, Supplementary Figures 1 and 2, and descriptions of Supplementary Tables 2–8. Supplementary File 2 contains Supplementary Tables 2–6 and 8. Supplementary File 3 contains Supplementary Table 7. Raw sequencing data used for the analyses in this manuscript are available from the Sequence Read Archive under accessions PRJNA278450, PRJNA312948, PRJNA307199, PRJNA343789, PRJNA357321, PRJNA293912, PRJNA369259, PRJNA305077, PRJNA306070, PRJNA82745 and PRJNA324705; from the European Genome-phenome Archive under accessions EGAD00001004352 and EGAD00001002731; and by direct request to the authors. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ABEMUS: platform-specific and data-informed detection of somatic SNVs in cfDNA

Bioinformatics ◽

10.1093/bioinformatics/btaa016 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2665-2674

Author(s):

Nicola Casiraghi ◽

Francesco Orlando ◽

Yari Ciani ◽

Jenny Xiang ◽

Andrea Sboner ◽

...

Keyword(s):

Cancer Patients ◽

R Package ◽

Circulating Tumor Dna ◽

Supplementary Information ◽

Sequencing Error ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Liquid Biopsies ◽

Non Invasive ◽

Cross Platform

Abstract Motivation The use of liquid biopsies for cancer patients enables the non-invasive tracking of treatment response and tumor dynamics through single or serial blood drawn tests. Next-generation sequencing assays allow for the simultaneous interrogation of extended sets of somatic single-nucleotide variants (SNVs) in circulating cell-free DNA (cfDNA), a mixture of DNA molecules originating both from normal and tumor tissue cells. However, low circulating tumor DNA (ctDNA) fractions together with sequencing background noise and potential tumor heterogeneity challenge the ability to confidently call SNVs. Results We present a computational methodology, called Adaptive Base Error Model in Ultra-deep Sequencing data (ABEMUS), which combines platform-specific genetic knowledge and empirical signal to readily detect and quantify somatic SNVs in cfDNA. We tested the capability of our method to analyze data generated using different platforms with distinct sequencing error properties and we compared ABEMUS performances with other popular SNV callers on both synthetic and real cancer patients sequencing data. Results show that ABEMUS performs better in most of the tested conditions proving its reliability in calling low variant allele frequencies somatic SNVs in low ctDNA levels plasma samples. Availability and implementation ABEMUS is cross-platform and can be installed as R package. The source code is maintained on Github at http://github.com/cibiobcg/abemus, and it is also available at CRAN official R repository. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The discrepancy among single nucleotide variants detected by DNA and RNA high throughput sequencing data

BMC Genomics ◽

10.1186/s12864-017-4022-x ◽

2017 ◽

Vol 18 (S6) ◽

Cited By ~ 16

Author(s):

Yan Guo ◽

Shilin Zhao ◽

Quanhu Sheng ◽

David C Samuels ◽

Yu Shyr

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Dna And Rna ◽

High Throughput Sequencing Data

Download Full-text

Differential Genome-Wide Mutational Patterns in Indolent B-Cell Lymphomas

Blood ◽

10.1182/blood-2018-99-116174 ◽

2018 ◽

Vol 132 (Supplement 1) ◽

pp. 4102-4102

Author(s):

Julieta Haydee Sepulveda Yanez ◽

Diego Alvarez ◽

Jose Fernandez-Goycoolea ◽

Cornelis A.M. van Bergen ◽

Hendrik Veelken ◽

...

Keyword(s):

B Cell ◽

Cell Lymphoma ◽

R Package ◽

Cpg Methylation ◽

Unknown Etiology ◽

Sequencing Data ◽

B Cell Lymphomas ◽

B Cell Malignancies ◽

Mutational Signatures ◽

Mutational Landscape

Abstract Introduction: In recent years, strategies have been developed to identify specific mutation patterns within next-generation sequencing data. Distinct mutational patterns can be linked to underlying mutagenic processes in human cancer. One approach analyzes single base substitutions in the context of their neighboring bases as trinucleotides. The relative prevalence of all possible 96 altered trinucleotides defines distinctive mutational signatures. The activity of activation-induced cytidine deaminase (AID) initiates a specific mutational process in B cells. AID induces deamination of deoxycytidine into deoxyuridine. Subsequent mechanisms to repair the resulting mismatch lead to different genomic alterations that can be assigned to three mutational signatures: a canonical signature characterized by C>T/G transitions at WRCY motifs, a non-canonical signature defined by A>C transversions at WAN motifs, and a third AID signature characterized by C>T transitions at RCG motifs with preference for methylated CpG (W: A or T; R: purine; Y: pyrimidine, N: any nucleotide). The latter signature has specifically been designated as AID-mediated CpG-methylation-dependent mutagenesis. AID activity has been linked to the pathogenesis of several B-cell lymphomas, including follicular lymphoma (FL), chronic lymphocytic leukemia (CLL), and mantle cell lymphoma (MCL). Therefore, we searched for the contribution of different AID signatures in these B-cell malignancies. Methods: We analyzed the mutational landscape in whole exome (WES) and whole genome (WGS) sequencing data from 41 FL, 30 CLL, 2 MBL, and 43 MCL cases. Somatic variants were called by comparison of tumor and germline DNA with an in-house developed pipeline. Mutational signatures were defined according to the 96-base substitution model (Alexandrov et al. 2013) by an unsupervised machine learning with implementation of the SomaticSignatures R package (Gehring et al. 2015). In addition, MutationalPattern R package (Blokzijl et al. 2018) was executed for comparison to mutational signatures defined in COSMIC. Results: In unsupervised analyses of FL, CLL/MBL, and MCL cases, 77% of the mutation spectrum variance was attributable to four signatures (S1-4). In FL, the mutational landscape was dominated by S4 characterized by mutations in both canonical and non-canonical AID motifs (40%, 95% CI: 35-76%). The second most frequent signature (S2; 27%, 21-49%) was characterized by C>A transitions in the context of the non-canonical AID and the CpG hotspot motifs (RCG). The mutational landscape of CLL and MBL was strongly dominated by signature S3 (50%, 45-95%). S3 contains mutations in RCG motifs as well as mutations in non-canonical AID motifs (NTW), but with a lower contribution that in S4. In contrast, the mutational landscape of MCL was dominated by S1 (31%, 24-55%) characterized by C>T transitions in the RCG motif in addition to a striking prevalence of the TCT>TTT transition that is known to be associated with the activity of APOBEC enzymes. In comparison to the mutational signatures in COSMIC, the lymphomas analyzed here carry a strong similarity to the COSMIC signatures 1, 5, and 25. These signatures are observed across a wide spectrum of cancer types and are either of unknown etiology (S5 and S25) or associated with age (S1). Conclusions: The most common point mutations in CLL/MBL and FL are C>T transitions and indicate a strong influence of AID on their mutational landscape. In the indolent B-cell malignancies, all three known AID-related signatures, i.e. canonical, non-canonical, and CpG-methylation-dependent can be found. In contrast, the genomic landscape of MCL is dominated by variants in CpG-methylation-dependent mutagenesis sites and by an APOBEC-related motif. In addition to AID-related signatures, we also found consensus signatures described in COSMIC such as the age-related spontaneous deamination signature 1. Our work independently confirms the role of AID in B-cell lymphoma pathogenesis but points to disease-specific mechanisms that modulate AID in the respective lymphoma cell of origin. In addition, our data suggest that distinctive repair mechanisms operate in different entities. Disclosures No relevant conflicts of interest to declare.

Download Full-text

Misannotation of multiple-nucleotide variants risks misdiagnosis

Wellcome Open Research ◽

10.12688/wellcomeopenres.15420.1 ◽

2019 ◽

Vol 4 ◽

pp. 145

Author(s):

Matthew N. Wakeling ◽

Thomas W. Laver ◽

Kevin Colclough ◽

Andrew Parish ◽

Sian Ellard ◽

...

Keyword(s):

Best Practices ◽

False Negative ◽

Simulated Data ◽

Sequencing Analysis ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Public Resources ◽

Next Generation Sequencing Analysis ◽

Optimal Approach

Multiple Nucleotide Variants (MNVs) are miscalled by the most widely utilised next generation sequencing analysis (NGS) pipelines, presenting the potential for missing diagnoses that would previously have been made by standard Sanger (dideoxy) sequencing. These variants, which should be treated as a single insertion-deletion mutation event, are commonly called as separate single nucleotide variants. This can result in misannotation, incorrect amino acid predictions and potentially false positive and false negative diagnostic results. This risk will be increased as confirmatory Sanger sequencing of Single Nucleotide variants (SNVs) ceases to be standard practice. Using simulated data and re-analysis of sequencing data from a diagnostic targeted gene panel, we demonstrate that the widely adopted pipeline, GATK best practices, results in miscalling of MNVs and that alternative tools can call these variants correctly. The adoption of calling methods that annotate MNVs correctly would present a solution for individual laboratories, however GATK best practices are the basis for important public resources such as the gnomAD database. We suggest integrating a solution into these guidelines would be the optimal approach.

Download Full-text

Identification of single nucleotide variants using position-specific error estimation in deep sequencing data

10.1101/475947 ◽

2018 ◽

Author(s):

Dimitrios Kleftogiannis ◽

Marco Punta ◽

Anuradha Jayaram ◽

Shahneen Sandhu ◽

Stephen Q. Wong ◽

...

Keyword(s):

Deep Sequencing ◽

Low Frequency ◽

Poisson Model ◽

Real Data ◽

Analytical Sensitivity ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Deep Sequencing Data ◽

Targeted Deep Sequencing

AbstractBackgroundTargeted deep sequencing is a highly effective technology to identify known and novel single nucleotide variants (SNVs) with many applications in translational medicine, disease monitoring and cancer profiling. However, identification of SNVs using deep sequencing data is a challenging computational problem as different sequencing artifacts limit the analytical sensitivity of SNV detection, especially at low variant allele frequencies (VAFs).MethodsTo address the problem of relatively high noise levels in amplicon-based deep sequencing data (e.g. with the Ion AmpliSeq technology) in the context of SNV calling, we have developed a new bioinformatics tool called AmpliSolve. AmpliSolve uses a set of normal samples to model position-specific, strand-specific and nucleotide-specific background artifacts (noise), and deploys a Poisson model-based statistical framework for SNV detection.ResultsOur tests on both synthetic and real data indicate that AmpliSolve achieves a good trade-off between precision and sensitivity, even at VAF below 5% and as low as 1%. We further validate AmpliSolve by applying it to the detection of SNVs in 96 circulating tumor DNA samples at three clinically relevant genomic positions and compare the results to digital droplet PCR experiments.ConclusionsAmpliSolve is a new tool for in-silico estimation of background noise and for detection of low frequency SNVs in targeted deep sequencing data. Although AmpliSolve has been specifically designed for and tested on amplicon-based libraries sequenced with the Ion Torrent platform it can, in principle, be applied to other sequencing platforms as well. AmpliSolve is freely available at https://github.com/dkleftogi/AmpliSolve.

Download Full-text

25 years of propagation in suspension cell culture results in substantial alterations of the Arabidopsis thaliana genome

10.1101/710624 ◽

2019 ◽

Author(s):

Boas Pucker ◽

Christian Rückert ◽

Ralf Stracke ◽

Prisca Viehöver ◽

Jörn Kalinowski ◽

...

Keyword(s):

Cell Culture ◽

Arabidopsis Thaliana ◽

Large Scale ◽

Suspension Cell ◽

Reference Sequence ◽

Model Organisms ◽

Suspension Cell Culture ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Haploid Genome Size

AbstractArabidopsis thaliana is one of the best studied plant model organisms. Besides cultivation in greenhouses, cells of this plant can also be propagated in suspension cell culture. At7 is one such cell line that has been established about 25 years ago. Here we report the sequencing and the analysis of the At7 genome. Large scale duplications and deletions compared to the Col-0 reference sequence were detected. The number of deletions exceeds the number of insertions thus indicating that a haploid genome size reduction is ongoing. Patterns of small sequence variants differ from the ones observed between A. thaliana accessions e.g. the number of single nucleotide variants matches the number of insertions/deletions. RNA-Seq analysis reveals that disrupted alleles are less frequent in the transcriptome than the native ones.

Download Full-text