scholarly journals Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator

Author(s):  
Erik N. Bergstrom ◽  
Mark Barnes ◽  
Iñigo Martincorena ◽  
Ludmil B. Alexandrov

ABSTRACTPerforming a statistical test requires a null hypothesis. In cancer genomics, a key challenge is the fast generation of accurate somatic mutational landscapes that can be used as a realistic null hypothesis for making biological discoveries. Here we present SigProfilerSimulator, a powerful tool that is capable of simulating the mutational landscapes of thousands of cancer genomes at different resolutions within seconds. Applying SigProfilerSimulator to 2,144 whole-genome sequenced cancers reveals: (i) that most doublet base substitutions are not due to two adjacent single base substitutions but likely occur as single genomic events; (ii) that an extended sequencing context of +/-2bp is required to more completely capture the patterns of substitution mutational signatures in human cancer; (iii) information on false-positive discovery rate of commonly used bioinformatics tools for detecting driver genes. SigProfilerSimulator’s breadth of features allows one to construct a tailored null hypothesis and use it for evaluating the accuracy of other bioinformatics tools or for downstream statistical analysis for biological discoveries.

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Erik N. Bergstrom ◽  
Mark Barnes ◽  
Iñigo Martincorena ◽  
Ludmil B. Alexandrov

Abstract Background Performing a statistical test requires a null hypothesis. In cancer genomics, a key challenge is the fast generation of accurate somatic mutational landscapes that can be used as a realistic null hypothesis for making biological discoveries. Results Here we present SigProfilerSimulator, a powerful tool that is capable of simulating the mutational landscapes of thousands of cancer genomes at different resolutions within seconds. Applying SigProfilerSimulator to 2144 whole-genome sequenced cancers reveals: (i) that most doublet base substitutions are not due to two adjacent single base substitutions but likely occur as single genomic events; (ii) that an extended sequencing context of ± 2 bp is required to more completely capture the patterns of substitution mutational signatures in human cancer; (iii) information on false-positive discovery rate of commonly used bioinformatics tools for detecting driver genes. Conclusions SigProfilerSimulator’s breadth of features allows one to construct a tailored null hypothesis and use it for evaluating the accuracy of other bioinformatics tools or for downstream statistical analysis for biological discoveries. SigProfilerSimulator is freely available at https://github.com/AlexandrovLab/SigProfilerSimulator with an extensive documentation at https://osf.io/usxjz/wiki/home/.


2021 ◽  
Author(s):  
Erik N Bergstrom ◽  
Jens-Christian Luebeck ◽  
Mia Petljak ◽  
Vineet Bafna ◽  
Paul S. Mischel ◽  
...  

Clustered somatic mutations are common in cancer genomes with prior analyses revealing several types of clustered single-base substitutions, including doublet- and multi-base substitutions, diffuse hypermutation termed omikli, and longer strand-coordinated events termed kataegis. Here, we provide a comprehensive characterization of clustered substitutions and clustered small insertions and deletions (indels) across 2,583 whole-genome sequenced cancers from 30 cancer types. While only 3.7% of substitutions and 0.9% of indels were found to be clustered, they contributed 8.4% and 6.9% of substitution and indel drivers, respectively. Multiple distinct mutational processes gave rise to clustered indels including signatures enriched in tobacco smokers and homologous-recombination deficient cancers. Doublet-base substitutions were caused by at least 12 mutational processes, while the majority of multi-base substitutions were generated by either tobacco smoking or exposure to ultraviolet light. Omikli events, previously attributed to the activity of APOBEC3 deaminases, accounted for a large proportion of clustered substitutions. However, only 16.2% of omikli matched APOBEC3 patterns with experimental validation confirming additional mutational processes giving rise to omikli. Kataegis was generated by multiple mutational processes with 76.1% of all kataegic events exhibiting AID/APOBEC3-associated mutational patterns. Co-occurrence of APOBEC3 kataegis and extrachromosomal-DNA (ecDNA) was observed in 31% of samples with ecDNA. Multiple distinct APOBEC3 kataegic events were observed on most mutated ecDNA. ecDNA containing known cancer genes exhibited both positive selection and kataegic hypermutation. Our results reveal the diversity of clustered mutational processes in human cancer and the role of APOBEC3 in recurrently mutating and fueling the evolution of ecDNA.


2021 ◽  
Author(s):  
Mia Petljak ◽  
Kevan Chu ◽  
Alexandra Dananberg ◽  
Erik N. Bergstrom ◽  
Patrick von Morgen ◽  
...  

ABSTRACTThe APOBEC3 family of cytidine deaminases is widely speculated to be a major source of somatic mutations in cancer1–3. However, causal links between APOBEC3 enzymes and mutations in human cancer cells have not been established. The identity of the APOBEC3 paralog(s) that may act as prime drivers of mutagenesis and the mechanisms underlying different APOBEC3-associated mutational signatures are unknown. To directly investigate the roles of APOBEC3 enzymes in cancer mutagenesis, candidate APOBEC3 genes were deleted from cancer cell lines recently found to naturally generate APOBEC3-associated mutations in episodic bursts4. Deletion of the APOBEC3A paralog severely diminished the acquisition of mutations of speculative APOBEC3 origins in breast cancer and lymphoma cell lines. APOBEC3 mutational burdens were undiminished in APOBEC3B knockout cell lines. APOBEC3A deletion reduced the appearance of the clustered mutation types kataegis and omikli, which are frequently found in cancer genomes. The uracil glycosylase UNG and the translesion polymerase REV1 were found to play critical roles in the generation of mutations induced by APOBEC3A. These data represent the first evidence for a long-postulated hypothesis that APOBEC3 deaminases generate prevalent clustered and non-clustered mutational signatures in human cancer cells, identify APOBEC3A as a driver of episodic mutational bursts, and dissect the roles of the relevant enzymes in generating the associated mutations in breast cancer and B cell lymphoma cell lines.


2021 ◽  
Author(s):  
John Maciejowski ◽  
Mia Petljak ◽  
Kevan Chu ◽  
Alexandra Dananberg ◽  
Erik Bergstrom ◽  
...  

Abstract The APOBEC3 family of cytidine deaminases is widely speculated to be a major source of somatic mutations in cancer1–3. However, causal links between APOBEC3 enzymes and mutations in human cancer cells have not been established. The identity of the APOBEC3 paralog(s) that may act as prime drivers of mutagenesis and the mechanisms underlying different APOBEC3-associated mutational signatures are unknown. To directly investigate the roles of APOBEC3 enzymes in cancer mutagenesis, candidate APOBEC3 genes were deleted from cancer cell lines recently found to naturally generate APOBEC3-associated mutations in episodic bursts4. Deletion of the APOBEC3A paralog severely diminished the acquisition of mutations of speculative APOBEC3 origins in breast cancer and lymphoma cell lines. APOBEC3 mutational burdens were undiminished in APOBEC3B knockout cell lines. APOBEC3A deletion reduced the appearance of the clustered mutation types kataegis and omikli, which are frequently found in cancer genomes. The uracil glycosylase UNG and the translesion polymerase REV1 were found to play critical roles in the generation of mutations induced by APOBEC3A. These data represent the first evidence for a long-postulated hypothesis that APOBEC3 deaminases generate prevalent clustered and non-clustered mutational signatures in human cancer cells, identify APOBEC3A as a driver of episodic mutational bursts, and dissect the roles of the relevant enzymes in generating the associated mutations in breast cancer and B cell lymphoma cell lines.


2018 ◽  
Author(s):  
Ludmil B Alexandrov ◽  
Jaegil Kim ◽  
Nicholas J Haradhvala ◽  
Mi Ni Huang ◽  
Alvin WT Ng ◽  
...  

ABSTRACTSomatic mutations in cancer genomes are caused by multiple mutational processes each of which generates a characteristic mutational signature. Using 84,729,690 somatic mutations from 4,645 whole cancer genome and 19,184 exome sequences encompassing most cancer types we characterised 49 single base substitution, 11 doublet base substitution, four clustered base substitution, and 17 small insertion and deletion mutational signatures. The substantial dataset size compared to previous analyses enabled discovery of new signatures, separation of overlapping signatures and decomposition of signatures into components that may represent associated, but distinct, DNA damage, repair and/or replication mechanisms. Estimation of the contribution of each signature to the mutational catalogues of individual cancer genomes revealed associations with exogenous and endogenous exposures and defective DNA maintenance processes. However, many signatures are of unknown cause. This analysis provides a systematic perspective on the repertoire of mutational processes contributing to the development of human cancer including a comprehensive reference set of mutational signatures in human cancer.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Marleen M. Nieboer ◽  
Luan Nguyen ◽  
Jeroen de Ridder

AbstractOver the past years, large consortia have been established to fuel the sequencing of whole genomes of many cancer patients. Despite the increased abundance in tools to study the impact of SNVs, non-coding SVs have been largely ignored in these data. Here, we introduce svMIL2, an improved version of our Multiple Instance Learning-based method to study the effect of somatic non-coding SVs disrupting boundaries of TADs and CTCF loops in 1646 cancer genomes. We demonstrate that svMIL2 predicts pathogenic non-coding SVs with an average AUC of 0.86 across 12 cancer types, and identifies non-coding SVs affecting well-known driver genes. The disruption of active (super) enhancers in open chromatin regions appears to be a common mechanism by which non-coding SVs exert their pathogenicity. Finally, our results reveal that the contribution of pathogenic non-coding SVs as opposed to driver SNVs may highly vary between cancers, with notably high numbers of genes being disrupted by pathogenic non-coding SVs in ovarian and pancreatic cancer. Taken together, our machine learning method offers a potent way to prioritize putatively pathogenic non-coding SVs and leverage non-coding SVs to identify driver genes. Moreover, our analysis of 1646 cancer genomes demonstrates the importance of including non-coding SVs in cancer diagnostics.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Cesim Erten ◽  
Aissa Houdjedj ◽  
Hilal Kazan

Abstract Background Recent cancer genomic studies have generated detailed molecular data on a large number of cancer patients. A key remaining problem in cancer genomics is the identification of driver genes. Results We propose BetweenNet, a computational approach that integrates genomic data with a protein-protein interaction network to identify cancer driver genes. BetweenNet utilizes a measure based on betweenness centrality on patient specific networks to identify the so-called outlier genes that correspond to dysregulated genes for each patient. Setting up the relationship between the mutated genes and the outliers through a bipartite graph, it employs a random-walk process on the graph, which provides the final prioritization of the mutated genes. We compare BetweenNet against state-of-the art cancer gene prioritization methods on lung, breast, and pan-cancer datasets. Conclusions Our evaluations show that BetweenNet is better at recovering known cancer genes based on multiple reference databases. Additionally, we show that the GO terms and the reference pathways enriched in BetweenNet ranked genes and those that are enriched in known cancer genes overlap significantly when compared to the overlaps achieved by the rankings of the alternative methods.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ege Ülgen ◽  
O. Uğur Sezerman

Abstract Background Cancer develops due to “driver” alterations. Numerous approaches exist for predicting cancer drivers from cohort-scale genomics data. However, methods for personalized analysis of driver genes are underdeveloped. In this study, we developed a novel personalized/batch analysis approach for driver gene prioritization utilizing somatic genomics data, called driveR. Results Combining genomics information and prior biological knowledge, driveR accurately prioritizes cancer driver genes via a multi-task learning model. Testing on 28 different datasets, this study demonstrates that driveR performs adequately, achieving a median AUC of 0.684 (range 0.651–0.861) on the 28 batch analysis test datasets, and a median AUC of 0.773 (range 0–1) on the 5157 personalized analysis test samples. Moreover, it outperforms existing approaches, achieving a significantly higher median AUC than all of MutSigCV (Wilcoxon rank-sum test p < 0.001), DriverNet (p < 0.001), OncodriveFML (p < 0.001) and MutPanning (p < 0.001) on batch analysis test datasets, and a significantly higher median AUC than DawnRank (p < 0.001) and PRODIGY (p < 0.001) on personalized analysis datasets. Conclusions This study demonstrates that the proposed method is an accurate and easy-to-utilize approach for prioritizing driver genes in cancer genomes in personalized or batch analyses. driveR is available on CRAN: https://cran.r-project.org/package=driveR.


Sign in / Sign up

Export Citation Format

Share Document