De novo mutational signature discovery in tumor genomes using SparseSignatures

Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.

Download Full-text

De Novo Mutational Signature Discovery in Tumor Genomes using SparseSignatures

10.1101/384834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Avantika Lal ◽

Keli Liu ◽

Robert Tibshirani ◽

Arend Sidow ◽

Daniele Ramazzotti

Keyword(s):

Cross Validation ◽

De Novo ◽

State Of The Art ◽

Point Mutations ◽

Simulated Data ◽

Large Datasets ◽

Genome Sequences ◽

Mutational Signatures ◽

Mutational Signature ◽

Current State

AbstractCancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates DNA replication error as a background, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using standard metrics. We then apply SparseSignatures to whole genome sequences of 147 tumors from pancreatic cancer, discovering 8 signatures in addition to the background.

Download Full-text

Genotyping structural variants in pangenome graphs using the vg toolkit

10.1101/654566 ◽

2019 ◽

Cited By ~ 7

Author(s):

Glenn Hickey ◽

David Heller ◽

Jean Monlong ◽

Jonas A. Sibbesen ◽

Jouni Sirén ◽

...

Keyword(s):

De Novo ◽

State Of The Art ◽

Effective Means ◽

Point Mutations ◽

Structural Variants ◽

Short Read ◽

Yeast Strains ◽

Sequencing Studies ◽

Long Read

AbstractStructural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmarked vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.

Download Full-text

Overlap graph-based generation of haplotigs for diploids and polyploids

10.1101/378356 ◽

2018 ◽

Author(s):

Jasmijn A. Baaijens ◽

Alexander Schönhuth

Keyword(s):

Recent Work ◽

Genome Assembly ◽

De Novo ◽

Iterative Scheme ◽

State Of The Art ◽

Simulated Data ◽

Specific Sequence ◽

New Approach ◽

Link Type ◽

Polyploid Genome

AbstractHaplotype aware genome assembly plays an important role in genetics, medicine, and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. We present POLYTE (POLYploid genome fitTEr) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++.

Download Full-text

CaMuS: simultaneous fitting and de novo imputation of cancer mutational signature

Scientific Reports ◽

10.1038/s41598-020-75753-8 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Maria Cartolano ◽

Nima Abedpour ◽

Viktor Achter ◽

Tsun-Po Yang ◽

Sandra Ackermann ◽

...

Keyword(s):

De Novo ◽

Probability Distributions ◽

Simulated Data ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Mutational Signatures ◽

Computational Performance ◽

Reliable Parameter ◽

Similar Accuracy ◽

Mutational Processes

Abstract The identification of the mutational processes operating in tumour cells has implications for cancer diagnosis and therapy. These processes leave mutational patterns on the cancer genomes, which are referred to as mutational signatures. Recently, 81 mutational signatures have been inferred using computational algorithms on sequencing data of 23,879 samples. However, these published signatures may not always offer a comprehensive view on the biological processes underlying tumour types that are not included or underrepresented in the reference studies. To circumvent this problem, we designed CaMuS (Cancer Mutational Signatures) to construct de novo signatures while simultaneously fitting publicly available mutational signatures. Furthermore, we propose to estimate signature similarity by comparing probability distributions using the Hellinger distance. We applied CaMuS to infer signatures of mutational processes in poorly studied cancer types. We used whole genome sequencing data of 56 neuroblastoma, thus providing evidence for the versatility of CaMuS. Using simulated data, we compared the performance of CaMuS to sigfit, a recently developed algorithm with comparable inference functionalities. CaMuS and sigfit reconstructed the simulated datasets with similar accuracy; however two main features may argue for CaMuS over sigfit: (i) superior computational performance and (ii) a reliable parameter selection method to avoid spurious signatures.

Download Full-text

The Rater Bundle Model

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986026003283 ◽

2001 ◽

Vol 26 (3) ◽

pp. 283-306 ◽

Cited By ~ 30

Author(s):

Mark Wilson ◽

Machteld Hoskens

Keyword(s):

Item Response ◽

Conditional Independence ◽

State Of The Art ◽

Simulated Data ◽

Response Model ◽

Item Response Model ◽

Student Work ◽

Response Models ◽

Item Response Models ◽

Current State

In this article an item response model is introduced for repeated ratings of student work, which we have called the Rater Bundle Model (RBM). Development of this model was motivated by the observation that when repeated ratings occur, the assumption of conditional independence is violated, and hence current state-of-the-art item response models, such as the rater facets model, that ignore this violation, underestimate measurement error, and overestimate reliability. In the rater bundle model these dependencies are explicitly parameterized. The model is applied to both real and simulated data to illustrate the approach.

Download Full-text

Metagenomics Strain Resolution on Assembly Graphs

10.1101/2020.09.06.284828 ◽

2020 ◽

Author(s):

Christopher Quince ◽

Sergey Nurk ◽

Sebastien Raguideau ◽

Robert James ◽

Orkun S. Soyer ◽

...

Keyword(s):

De Novo ◽

State Of The Art ◽

Single Copy ◽

Community Members ◽

Bioinformatics Pipeline ◽

Read Mapping ◽

Bayesian Algorithm ◽

Anaerobic Digestor ◽

Current State ◽

Core Genes

AbstractWe introduce a novel bioinformatics pipeline, STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, when multiple metagenome samples from the same community are available. STRONG performs coassembly, followed by binning into metagenome assembled genomes (MAGs), but uniquely it stores the coassembly graph prior to simplification of variants. This enables the subgraphs for individual single-copy core genes (SCGs) in each MAG to be extracted. It can then thread back reads from the samples to compute per sample coverages for the unitigs in these graphs. These graphs and their unitig coverages are then used in a Bayesian algorithm, BayesPaths, that determines the number of strains present, their sequences or haplotypes on the SCGs and their abundances in each of the samples.Our approach both avoids the ambiguities of read mapping and allows more of the information on co-occurrence of variants in reads to be utilised than if variants were treated independently, whilst at the same time exploiting the correlation of variants across samples that occurs when they are linked in the same strain. We compare STRONG to the current state of the art on synthetic communities and demonstrate that we can recover more strains, more accurately, and with a realistic estimate of uncertainty deriving from the variational Bayesian algorithm employed for the strain resolution. On a real anaerobic digestor time series we obtained strain-resolved SCGs for over 300 MAGs that for abundant community members match those observed from long Nanopore reads.

Download Full-text

Combining phenome-driven drug-target interaction prediction with patients’ electronic health records-based clinical corroboration toward drug discovery

Bioinformatics ◽

10.1093/bioinformatics/btaa451 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i436-i444 ◽

Cited By ~ 2

Author(s):

Mengshi Zhou ◽

Chunlei Zheng ◽

Rong Xu

Keyword(s):

Drug Discovery ◽

Electronic Health Records ◽

Drug Target ◽

Cross Validation ◽

De Novo ◽

State Of The Art ◽

P Value ◽

Prediction System ◽

Health Records ◽

Drug Candidates

Abstract Motivation Predicting drug–target interactions (DTIs) using human phenotypic data have the potential in eliminating the translational gap between animal experiments and clinical outcomes in humans. One challenge in human phenome-driven DTI predictions is integrating and modeling diverse drug and disease phenotypic relationships. Leveraging large amounts of clinical observed phenotypes of drugs and diseases and electronic health records (EHRs) of 72 million patients, we developed a novel integrated computational drug discovery approach by seamlessly combining DTI prediction and clinical corroboration. Results We developed a network-based DTI prediction system (TargetPredict) by modeling 855 904 phenotypic and genetic relationships among 1430 drugs, 4251 side effects, 1059 diseases and 17 860 genes. We systematically evaluated TargetPredict in de novo cross-validation and compared it to a state-of-the-art phenome-driven DTI prediction approach. We applied TargetPredict in identifying novel repositioned candidate drugs for Alzheimer’s disease (AD), a disease affecting over 5.8 million people in the United States. We evaluated the clinical efficiency of top repositioned drug candidates using EHRs of over 72 million patients. The area under the receiver operating characteristic (ROC) curve was 0.97 in the de novo cross-validation when evaluated using 910 drugs. TargetPredict outperformed a state-of-the-art phenome-driven DTI prediction system as measured by precision–recall curves [measured by average precision (MAP): 0.28 versus 0.23, P-value < 0.0001]. The EHR-based case–control studies identified that the prescriptions top-ranked repositioned drugs are significantly associated with lower odds of AD diagnosis. For example, we showed that the prescription of liraglutide, a type 2 diabetes drug, is significantly associated with decreased risk of AD diagnosis [adjusted odds ratios (AORs): 0.76; 95% confidence intervals (CI) (0.70, 0.82), P-value < 0.0001]. In summary, our integrated approach that seamlessly combines computational DTI prediction and large-scale patients’ EHRs-based clinical corroboration has high potential in rapidly identifying novel drug targets and drug candidates for complex diseases. Availability and implementation nlp.case.edu/public/data/TargetPredict.

Download Full-text

Helixer: Cross-species gene annotation of large eukaryotic genomes using deep learning

Bioinformatics ◽

10.1093/bioinformatics/btaa1044 ◽

2020 ◽

Author(s):

Felix Stiehler ◽

Marvin Steinborn ◽

Stephan Scholz ◽

Daniela Dey ◽

Andreas P M Weber ◽

...

Keyword(s):

Deep Learning ◽

De Novo ◽

State Of The Art ◽

Gene Annotation ◽

Land Plant ◽

Supplementary Information ◽

Current State ◽

Learning Capabilities ◽

Vertebrate Model ◽

Eukaryotic Genomes

Abstract Motivation Current state-of-the-art tools for the de novo annotation of genes in eukaryotic genomes have to be specifically fitted for each species and still often produce annotations that can be improved much further. The fundamental algorithmic architecture for these tools has remained largely unchanged for about two decades, limiting learning capabilities. Here, we set out to improve the cross-species annotation of genes from DNA sequence alone with the help of deep learning. The goal is to eliminate the dependency on a closely related gene model while also improving the predictive quality in general with a fundamentally new architecture. Results We present Helixer, a framework for the development and usage of a cross-species deep learning model that improves significantly on performance and generalizability when compared to more traditional methods. We evaluate our approach by building a single vertebrate model for the base-wise annotation of 186 animal genomes and a separate land plant model for 51 plant genomes. Our predictions are shown to be much less sensitive to the length of the genome than those of a current state-of-the-art tool. We also present two novel post-processing techniques that each worked to further strengthen our annotations and show in-depth results of an RNA-Seq based comparison of our predictions. Our method does not yet produce comprehensive gene models but rather outputs base pair wise probabilities. Availability The source code of this work is available at https://github.com/weberlab-hhu/Helixer under the GNU General Public License v3.0. The trained models are available at https://doi.org/10.5281/zenodo.3974409 Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SUITOR: selecting the number of mutational signatures through cross-validation

10.1101/2021.07.28.454269 ◽

2021 ◽

Author(s):

Donghyuk Lee ◽

Difei Wang ◽

Xiaohong R. Yang ◽

Jianxin Shi ◽

Maria Teresa Landi ◽

...

Keyword(s):

Breast Cancer ◽

Cross Validation ◽

Cancer Genomics ◽

De Novo ◽

Optimal Number ◽

Prediction Errors ◽

Mutational Signatures ◽

Breast Cancer Study ◽

Almost All

For de novo mutational signature analysis, the critical first step is to decide how many signatures should be expected in a cancer genomics study. An incorrect number could mislead downstream analyses. Here we present SUITOR (Selecting the nUmber of mutatIonal signaTures thrOugh cRoss-validation), an unsupervised cross-validation method that requires little assumptions and no numerical approximations to select the optimal number of signatures without overfitting the data. In vitro studies and in silico simulations demonstrated that SUITOR can correctly identify signatures, some of which were missed by other widely used methods. Applied to 2,540 whole-genome sequenced tumors across 22 cancer types, SUITOR selected signatures with the smallest prediction errors and almost all signatures of breast cancer selected by SUITOR were validated in an independent breast cancer study. SUITOR is a powerful tool to select the optimal number of mutational signatures, facilitating downstream analyses with etiological or therapeutic importance.

Download Full-text

SUITOR: selecting the number of mutational signatures through cross-validation

10.21203/rs.3.rs-67930/v1 ◽

2020 ◽

Author(s):

Donghyuk Lee ◽

Difei Wang ◽

Xiaohong Yang ◽

Jianxin Shi ◽

Maria Teresa Landi ◽

...

Keyword(s):

Cross Validation ◽

Cancer Genomics ◽

De Novo ◽

Optimal Number ◽

Independent Study ◽

Prediction Errors ◽

Mutational Signatures ◽

Cancer Types ◽

Almost All

Abstract For de novo mutational signature analysis, the critical first step is to decide how many signatures should be expected in a cancer genomics study. An incorrect number could mislead downstream analyses. Here we present SUITOR (selecting the number of mutational signatures through cross-validation), an unsupervised cross-validation method that requires little assumptions and no numerical approximations to select the optimal number of signatures without overfitting the data. In vitro studies and in silico simulations demonstrated that SUITOR can correctly identify signatures, some of which were missed by other widely used methods. Applied to 1,536 whole-genome sequenced tumors across eight cancer types, SUITOR selected signatures with the smallest prediction errors and almost all signatures of breast cancer selected by SUITOR were validated in an independent study. SUITOR is a powerful tool to select the optimal number of mutational signatures, facilitating downstream analyses with etiological or therapeutic importance.

Download Full-text