scholarly journals Mutational signature learning with supervised negative binomial non-negative matrix factorization

2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i154-i160 ◽  
Author(s):  
Xinrui Lyu ◽  
Jean Garret ◽  
Gunnar Rätsch ◽  
Kjong-Van Lehmann

Abstract Motivation Understanding the underlying mutational processes of cancer patients has been a long-standing goal in the community and promises to provide new insights that could improve cancer diagnoses and treatments. Mutational signatures are summaries of the mutational processes, and improving the derivation of mutational signatures can yield new discoveries previously obscured by technical and biological confounders. Results from existing mutational signature extraction methods depend on the size of available patient cohort and solely focus on the analysis of mutation count data without considering the exploitation of metadata. Results Here we present a supervised method that utilizes cancer type as metadata to extract more distinctive signatures. More specifically, we use a negative binomial non-negative matrix factorization and add a support vector machine loss. We show that mutational signatures extracted by our proposed method have a lower reconstruction error and are designed to be more predictive of cancer type than those generated by unsupervised methods. This design reduces the need for elaborate post-processing strategies in order to recover most of the known signatures unlike the existing unsupervised signature extraction methods. Signatures extracted by a supervised model used in conjunction with cancer-type labels are also more robust, especially when using small and potentially cancer-type limited patient cohorts. Finally, we adapted our model such that molecular features can be utilized to derive an according mutational signature. We used APOBEC expression and MUTYH mutation status to demonstrate the possibilities that arise from this ability. We conclude that our method, which exploits available metadata, improves the quality of mutational signatures as well as helps derive more interpretable representations. Availability and implementation https://github.com/ratschlab/SNBNMF-mutsig-public. Supplementary information Supplementary data are available at Bioinformatics online.

2021 ◽  
Author(s):  
David Chen ◽  
Gurjit S. Randhawa ◽  
Maximillian P.M. Soltysiak ◽  
Camila P.E. de Souza ◽  
Lila Kari ◽  
...  

AbstractSummarySomaticSiMu is an in silico simulator of mutations in genome sequences. SomaticSiMu simulates single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. The tool is the first mutational signature simulator featuring a graphical user interface, control of mutation rates, and built-in visualization tools of the simulated mutations. SomaticSiMu generates simulated FASTA sequences and mutational catalogs with imposed mutational signatures. The reliability of SomaticSiMu to simulate mutational signatures was affirmed by supervised machine learning classification of simulated sequences with different mutation types and burdens, and mutational signature extraction from simulated mutational catalogs. SomaticSiMu is useful in validating sequence classification and mutational signature extraction tools.Availability and ImplementationSomaticSiMu is written in Python 3.8.3. The open-source code, documentation, and tutorials are available at https://github.com/HillLab/SomaticSiMu under the terms of the Creative Commons Attribution 4.0 International [email protected] informationSupplementary data are appended.


2018 ◽  
Vol 35 (8) ◽  
pp. 1395-1403 ◽  
Author(s):  
Yuan Luo ◽  
Chengsheng Mao ◽  
Yiben Yang ◽  
Fei Wang ◽  
Faraz S Ahmad ◽  
...  

Abstract Motivation Hypertension is a heterogeneous syndrome in need of improved subtyping using phenotypic and genetic measurements with the goal of identifying subtypes of patients who share similar pathophysiologic mechanisms and may respond more uniformly to targeted treatments. Existing machine learning approaches often face challenges in integrating phenotype and genotype information and presenting to clinicians an interpretable model. We aim to provide informed patient stratification based on phenotype and genotype features. Results In this article, we present a hybrid non-negative matrix factorization (HNMF) method to integrate phenotype and genotype information for patient stratification. HNMF simultaneously approximates the phenotypic and genetic feature matrices using different appropriate loss functions, and generates patient subtypes, phenotypic groups and genetic groups. Unlike previous methods, HNMF approximates phenotypic matrix under Frobenius loss, and genetic matrix under Kullback-Leibler (KL) loss. We propose an alternating projected gradient method to solve the approximation problem. Simulation shows HNMF converges fast and accurately to the true factor matrices. On a real-world clinical dataset, we used the patient factor matrix as features and examined the association of these features with indices of cardiac mechanics. We compared HNMF with six different models using phenotype or genotype features alone, with or without NMF, or using joint NMF with only one type of loss We also compared HNMF with 3 recently published methods for integrative clustering analysis, including iClusterBayes, Bayesian joint analysis and JIVE. HNMF significantly outperforms all comparison models. HNMF also reveals intuitive phenotype–genotype interactions that characterize cardiac abnormalities. Availability and implementation Our code is publicly available on github at https://github.com/yuanluo/hnmf. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Laura Cantini ◽  
Ulykbek Kairov ◽  
Aurélien de Reyniès ◽  
Emmanuel Barillot ◽  
François Radvanyi ◽  
...  

AbstractMotivationMatrix factorization methods are widely exploited in order to reduce dimensionality of transcriptomic datasets to the action of few hidden factors (metagenes). Applying such methods to similar independent datasets should yield reproducible inter-series outputs, though it was never demonstrated yet.ResultsWe systematically test state-of-art methods of matrix factorization on several transcriptomic datasets of the same cancer type. Inspired by concepts of evolutionary bioinformatics, we design a new framework based on Reciprocally Best Hit (RBH) graphs in order to benchmark the method’s reproducibility. We show that a particular protocol of application of Independent Component Analysis (ICA), accompanied by a stabilisation procedure, leads to a significant increase in the inter-series output reproducibility. Moreover, we show that the signals detected through this method are systematically more interpretable than those of other state-of-art methods. We developed a user-friendly tool BIODICA for performing the Stabilized ICA-based RBH meta-analysis. We apply this methodology to the study of colorectal cancer (CRC) for which 14 independent publicly available transcriptomic datasets can be collected. The resulting RBH graph maps the landscape of interconnected factors that can be associated to biological processes or to technological artefacts. These factors can be used as clinical biomarkers or robust and tumor-type specific transcriptomic signatures of tumoral cells or tumoral microenvironment. Their intensities in different samples shed light on the mechanistic basis of CRC molecular subtyping.AvailabilityThe BIODICA tool is available from https://github.com/LabBandSB/[email protected] and [email protected] informationSupplementary data are available at Bioinformatics online.


Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 3115-3115
Author(s):  
Kate E Ridout ◽  
Pauline Robbe ◽  
Doriane Cavalieri ◽  
Jennifer Becq ◽  
Miao He ◽  
...  

Abstract Background Chronic Lymphocytic Leukemia (CLL) is characterised by a highly heterogeneous natural history and treatment response. Indeed, 50% of immunoglobulin heavy chain variable region (IgHV) hypermutated patients have an excellent progression free survival (PFS) after chemoimmunotherapy. Conversely, 25% of FCR treated patients relapse within 24 months (high risk CLL). Recent studies have shown that complex karyotype with or without TP53 disruption predicts for relapse after BCL2 therapy and BTK inhibitors. However, TP53 is the only marker for which routine testing is available. Overall, nearly 80% of patients relapsing after frontline FCR do not present a known poor risk genomic marker. Additional candidate genomic predictors of poor outcome including mutations in coding regions of NOTCH1, SF3B1 and RPS15, non-coding regions of NOTCH1 and enhancer regions of PAX5, telomere length, IgHV status, and DNA Damage Repair (DDR) germline mutations including TP53 and ATM have been reported in CLL. Further, the role of mutational signatures and regions of kataegis also merit additional investigation in progressive CLL. Evaluating all candidate predictors requires complex time consuming, multi-modality testing outside the scope of routine clinical diagnostic practice, however, in isolation, each has low predictive value. Here, we show preliminary data on a novel patient stratification method based on whole genome sequencing (WGS) data incorporating multiple genomic features in a single test. Patients and Methods Tumor (peripheral blood) and germline (saliva) samples were collected from 321 patients from 6 UK trials via the Genomics England CLL pilot: ARCTIC (n=61), AdMIRe (n=64), CLL 210 (n=30), CLEAR (n=12), RIAltO (n=88) and FLAIR (n=66). We performed WGS on the HiSeqX (Illumina). After read alignment, we detected somatic variants using Strelka 2.4.7 for small variants detection (SNV and InDels), Manta 0.28.0 for structural variant (SV) detection, and Canvas 1.3.1 for copy number variant (CNV) detection (Illumina). Non-coding regions were annotated with information from primary CLL, CLL cell lines and B-cell ENCODE databases. Mutational signatures and putative regions of kataegis were calculated based on Alexandrov et al. (Nature, 2013) and Lawrence et al. (Nature, 2013). Telomere lengths were assessed using Telomerecat. Data aggregation was performed using contingency tables combined with non-negative matrix factorization. Results Mean coverage was 94.2X for tumor and 28.5X for germline samples. We found a median of 9172 SNPs/sample after filtering and 2348 indels/sample across 321 patients. High risk CLL was enriched for genomic complexity and poor prognostic mutations. The most frequently mutated genes were SF3B1 (17%), TP53 (13%), NOTCH1 (12%), IGLL5 (12%), and ATM (11%). Analysis of non-coding regions using DNA methylation markers, ATAC-seq and Hi-C revealed potential candidate regions associated with early relapse. Using CNA and SV data, we identified interesting patterns of genomic complexity and structural variants, including a trend towards enrichment of del8p in Relapse/Refractory and FCR non-responders. Additionally, we investigated mutation signatures and kataegis across coding and non-coding regions of the genome. We correlated exonic regions of DDR genes in germline data with clinical outcomes and extended this to genes mutated in both tumor and germline data, termed germline-tumor double-hits. We examined the relationship between the Alexandrov hypermutation signature, IgHV status (determined by % homology to the reference genome) and PFS, and combined mutational density at the Ig locus with mutation signature aiming to predict IgHV status. Finally, we produced a binary contingency matrix, using non-negative matrix factorization to cluster the samples. This method highlighted patient groups with shared genomic profiles. Conclusion We present preliminary data on a patient stratification method derived from WGS of 321 paired germline and CLL trial samples. Our predictive signature includes driver gene mutations, CNAs, IgHV status, genomic complexity, telomere length, overall mutation burden and genes with germline-tumor double-hits. Our comprehensive, NGS-based patient stratification attempts to predict patient outcome in a single sequencing run. Disclosures Becq: Illumina: Employment. He:Illumina: Employment. Ross:Illumina: Employment. Bentley:Illumina: Employment. Pettitt:Celgene: Research Funding; Gilead: Research Funding; Roche: Research Funding; GSK/Novartis: Research Funding; Napp: Research Funding; AstraZeneca: Research Funding; Chugai: Research Funding. Hillmen:Novartis: Research Funding; Gilead Sciences, Inc.: Honoraria, Research Funding; Alexion Pharmaceuticals, Inc: Consultancy, Honoraria; F. Hoffmann-La Roche Ltd: Research Funding; Celgene: Research Funding; Acerta: Membership on an entity's Board of Directors or advisory committees; Abbvie: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Pharmacyclics: Research Funding; Janssen: Consultancy, Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding. Schuh:Giles, Roche, Janssen, AbbVie: Honoraria.


2019 ◽  
Vol 35 (21) ◽  
pp. 4307-4313 ◽  
Author(s):  
Laura Cantini ◽  
Ulykbek Kairov ◽  
Aurélien de Reyniès ◽  
Emmanuel Barillot ◽  
François Radvanyi ◽  
...  

Abstract Motivation Matrix factorization (MF) methods are widely used in order to reduce dimensionality of transcriptomic datasets to the action of few hidden factors (metagenes). MF algorithms have never been compared based on the between-datasets reproducibility of their outputs in similar independent datasets. Lack of this knowledge might have a crucial impact when generalizing the predictions made in a study to others. Results We systematically test widely used MF methods on several transcriptomic datasets collected from the same cancer type (14 colorectal, 8 breast and 4 ovarian cancer transcriptomic datasets). Inspired by concepts of evolutionary bioinformatics, we design a novel framework based on Reciprocally Best Hit (RBH) graphs in order to benchmark the MF methods for their ability to produce generalizable components. We show that a particular protocol of application of independent component analysis (ICA), accompanied by a stabilization procedure, leads to a significant increase in the between-datasets reproducibility. Moreover, we show that the signals detected through this method are systematically more interpretable than those of other standard methods. We developed a user-friendly tool for performing the Stabilized ICA-based RBH meta-analysis. We apply this methodology to the study of colorectal cancer (CRC) for which 14 independent transcriptomic datasets can be collected. The resulting RBH graph maps the landscape of interconnected factors associated to biological processes or to technological artifacts. These factors can be used as clinical biomarkers or robust and tumor-type specific transcriptomic signatures of tumoral cells or tumoral microenvironment. Their intensities in different samples shed light on the mechanistic basis of CRC molecular subtyping. Availability and implementation The RBH construction tool is available from http://goo.gl/DzpwYp Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 39 (03) ◽  
pp. 334-340 ◽  
Author(s):  
Jean-Charles Nault ◽  
Eric Letouzé

AbstractEach hepatocellular carcinoma displays dozens of mutations in driver and passenger genes. The analysis of the types of substitutions and their trinucleotide context defines mutational signatures that recapitulate the endogenous and exogenous mutational processes operative in tumor cells. Aristolochic acid is present in plants from the genus Aristolochia and causes chronic nephropathy. Moreover, aristolochic acid has genotoxic properties responsible for the occurrence of urothelial carcinoma. Metabolites of aristolochic acid form DNA adducts on adenine residues leading to a specific mutational signature with almost exclusively A:T to T:A transversions, preferentially in a CTG trinucleotide context. Interestingly, this mutational fingerprint has been identified in a subset of hepatocellular carcinomas suggesting that aristolochic acid is a new risk factor for hepatocellular carcinoma. More data are warranted to capture the real impact of exposure to aristolochic acid in hepatocellular carcinoma occurrence worldwide.


2021 ◽  
Vol 11 (3) ◽  
pp. 1040
Author(s):  
Seokjin Lee ◽  
Minhan Kim ◽  
Seunghyeon Shin ◽  
Sooyoung Park ◽  
Youngho Jeong

In this paper, feature extraction methods are developed based on the non-negative matrix factorization (NMF) algorithm to be applied in weakly supervised sound event detection. Recently, the development of various features and systems have been attempted to tackle the problems of acoustic scene classification and sound event detection. However, most of these systems use data-independent spectral features, e.g., Mel-spectrogram, log-Mel-spectrum, and gammatone filterbank. Some data-dependent feature extraction methods, including the NMF-based methods, recently demonstrated the potential to tackle the problems mentioned above for long-term acoustic signals. In this paper, we further develop the recently proposed NMF-based feature extraction method to enable its application in weakly supervised sound event detection. To achieve this goal, we develop a strategy for training the frequency basis matrix using a heterogeneous database consisting of strongly- and weakly-labeled data. Moreover, we develop a non-iterative version of the NMF-based feature extraction method so that the proposed feature extraction method can be applied as a part of the model structure similar to the modern “on-the-fly” transform method for the Mel-spectrogram. To detect the sound events, the temporal basis is calculated using the NMF method and then used as a feature for the mean-teacher-model-based classifier. The results are improved for the event-wise post-processing method. To evaluate the proposed system, simulations of the weakly supervised sound event detection were conducted using the Detection and Classification of Acoustic Scenes and Events 2020 Task 4 database. The results reveal that the proposed system has F1-score performance comparable with the Mel-spectrogram and gammatonegram and exhibits 3–5% better performance than the log-Mel-spectrum and constant-Q transform.


2019 ◽  
Vol 35 (14) ◽  
pp. i492-i500
Author(s):  
Welles Robinson ◽  
Roded Sharan ◽  
Mark D M Leiserson

Abstract Motivation Somatic mutations result from processes related to DNA replication or environmental/lifestyle exposures. Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis. Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions. However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret. Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g. smoking history) or molecular features (e.g. inactivations to DNA damage repair genes). Results To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures. To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor’s observed covariates. We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations. On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recovering the ground truth exposure to similar signatures. We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors. We also discover four signatures in a combined melanoma and lung cancer cohort—using cancer type as a covariate—and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas. Availability and implementation TCSM is implemented in Python 3 and available at https://github.com/lrgr/tcsm, along with a data workflow for reproducing the experiments in the paper. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document