LOTUS: a Single- and Multitask Machine Learning Algorithm for the Prediction of Cancer Driver Genes

AbstractCancer driver genes, i.e., oncogenes and tumor suppressor genes, are involved in the acquisition of important functions in tumors, providing a selective growth advantage, allowing uncontrolled proliferation and avoiding apoptosis. It is therefore important to identify these driver genes, both for the fundamental understanding of cancer and to help finding new therapeutic targets. Although the most frequently mutated driver genes have been identified, it is believed that many more remain to be discovered, particularly for driver genes specific to some cancer types.In this paper we propose a new computational method called LOTUS to predict new driver genes. LOTUS is a machine-learning based approach which allows to integrate various types of data in a versatile manner, including informations about gene mutations and protein-protein interactions. In addition, LOTUS can predict cancer driver genes in a pan-cancer setting as well as for specific cancer types, using a multitask learning strategy to share information across cancer types.We empirically show that LOTUS outperforms three other state-of-the-art driver gene prediction methods, both in terms of intrinsic consistency and prediction accuracy, and provide predictions of new cancer genes across many cancer types.Author summaryCancer development is driven by mutations and dysfunction of important, so-called cancer driver genes, that could be targeted by targeted therapies. While a number of such cancer genes have already been identified, it is believed that many more remain to be discovered. To help prioritize experimental investigations of candidate genes, several computational methods have been proposed to rank promising candidates based on their mutations in large cohorts of cancer cases, or on their interactions with known driver genes in biological networks. We propose LOTUS, a new computational approach to identify genes with high oncogenic potential. LOTUS implements a machine learning approach to learn an oncogenic potential score from known driver genes, and brings two novelties compared to existing methods. First, it allows to easily combine heterogeneous informations into the scoring function, which we illustrate by learning a scoring function from both known mutations in large cancer cohorts and interactions in biological networks. Second, using a multitask learning strategy, it can predict different driver genes for different cancer types, while sharing information between them to improve the prediction for every type. We provide experimental results showing that LOTUS significantly outperforms several state-of-the-art cancer gene prediction softwares.

Download Full-text

Prediction of cancer driver genes through network-based moment propagation of mutation scores

Bioinformatics ◽

10.1093/bioinformatics/btaa452 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i508-i515 ◽

Cited By ~ 1

Author(s):

Anja C Gumpinger ◽

Kasper Lage ◽

Heiko Horn ◽

Karsten Borgwardt

Keyword(s):

Biological Networks ◽

Predictive Performance ◽

Supplementary Information ◽

Summary Statistics ◽

Cancer Genes ◽

Major Step ◽

Score Distribution ◽

Driver Genes ◽

Cancer Driver ◽

Cancer Driver Genes

Abstract Motivation Gaining a comprehensive understanding of the genetics underlying cancer development and progression is a central goal of biomedical research. Its accomplishment promises key mechanistic, diagnostic and therapeutic insights. One major step in this direction is the identification of genes that drive the emergence of tumors upon mutation. Recent advances in the field of computational biology have shown the potential of combining genetic summary statistics that represent the mutational burden in genes with biological networks, such as protein–protein interaction networks, to identify cancer driver genes. Those approaches superimpose the summary statistics on the nodes in the network, followed by an unsupervised propagation of the node scores through the network. However, this unsupervised setting does not leverage any knowledge on well-established cancer genes, a potentially valuable resource to improve the identification of novel cancer drivers. Results We develop a novel node embedding that enables classification of cancer driver genes in a supervised setting. The embedding combines a representation of the mutation score distribution in a node’s local neighborhood with network propagation. We leverage the knowledge of well-established cancer driver genes to define a positive class, resulting in a partially labeled dataset, and develop a cross-validation scheme to enable supervised prediction. The proposed node embedding followed by a supervised classification improves the predictive performance compared with baseline methods and yields a set of promising genes that constitute candidates for further biological validation. Availability and implementation Code available at https://github.com/BorgwardtLab/MoProEmbeddings. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

OncoVar: an integrated database and analysis platform for oncogenic driver variants in cancers

Nucleic Acids Research ◽

10.1093/nar/gkaa1033 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D1289-D1301 ◽

Cited By ~ 2

Author(s):

Tao Wang ◽

Shasha Ruan ◽

Xiaolu Zhao ◽

Xiaohui Shi ◽

Huajing Teng ◽

...

Keyword(s):

Cancer Genome ◽

The Cancer Genome Atlas ◽

Driver Mutations ◽

Cancer Genes ◽

Driver Genes ◽

Cancer Driver ◽

Cancer Cell Population ◽

Cancer Types ◽

Neutral Mutations ◽

Analysis Platform

Abstract The prevalence of neutral mutations in cancer cell population impedes the distinguishing of cancer-causing driver mutations from passenger mutations. To systematically prioritize the oncogenic ability of somatic mutations and cancer genes, we constructed a useful platform, OncoVar (https://oncovar.org/), which employed published bioinformatics algorithms and incorporated known driver events to identify driver mutations and driver genes. We identified 20 162 cancer driver mutations, 814 driver genes and 2360 pathogenic pathways with high-confidence by reanalyzing 10 769 exomes from 33 cancer types in The Cancer Genome Atlas (TCGA) and 1942 genomes from 18 cancer types in International Cancer Genome Consortium (ICGC). OncoVar provides four points of view, ‘Mutation’, ‘Gene’, ‘Pathway’ and ‘Cancer’, to help researchers to visualize the relationships between cancers and driver variants. Importantly, identification of actionable driver alterations provides promising druggable targets and repurposing opportunities of combinational therapies. OncoVar provides a user-friendly interface for browsing, searching and downloading somatic driver mutations, driver genes and pathogenic pathways in various cancer types. This platform will facilitate the identification of cancer drivers across individual cancer cohorts and helps to rank mutations or genes for better decision-making among clinical oncologists, cancer researchers and the broad scientific community interested in cancer precision medicine.

Download Full-text

Interpreting pathways to discover cancer driver genes with Moonlight

Nature Communications ◽

10.1038/s41467-019-13803-0 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 9

Author(s):

Antonio Colaprico ◽

Catharina Olsen ◽

Matthew H. Bailey ◽

Gabriel J. Odom ◽

Thilde Terkelsen ◽

...

Keyword(s):

Tumor Suppressors ◽

Molecular Mechanisms ◽

Dual Role ◽

Tissue Type ◽

Driver Gene ◽

Cancer Genes ◽

Driver Genes ◽

Cancer Driver ◽

Therapeutic Decisions ◽

Cancer Driver Genes

AbstractCancer driver gene alterations influence cancer development, occurring in oncogenes, tumor suppressors, and dual role genes. Discovering dual role cancer genes is difficult because of their elusive context-dependent behavior. We define oncogenic mediators as genes controlling biological processes. With them, we classify cancer driver genes, unveiling their roles in cancer mechanisms. To this end, we present Moonlight, a tool that incorporates multiple -omics data to identify critical cancer driver genes. With Moonlight, we analyze 8000+ tumor samples from 18 cancer types, discovering 3310 oncogenic mediators, 151 having dual roles. By incorporating additional data (amplification, mutation, DNA methylation, chromatin accessibility), we reveal 1000+ cancer driver genes, corroborating known molecular mechanisms. Additionally, we confirm critical cancer driver genes by analysing cell-line datasets. We discover inactivation of tumor suppressors in intron regions and that tissue type and subtype indicate dual role status. These findings help explain tumor heterogeneity and could guide therapeutic decisions.

Download Full-text

A CATH domain functional family based approach to identify putative cancer driver genes and driver mutations

10.1101/399014 ◽

2018 ◽

Author(s):

Paul Ashford ◽

Camilla S.M. Pang ◽

Aurelio A. Moya-García ◽

Tolulope Adeyelu ◽

Christine A. Orengo

Keyword(s):

Zinc Finger Protein ◽

Point Mutations ◽

Driver Mutations ◽

Cancer Genes ◽

Driver Genes ◽

Recurrent Point ◽

Cancer Driver ◽

Functional Sites ◽

Cancer Driver Genes ◽

Family Based

Tumour sequencing identifies highly recurrent point mutations in cancer driver genes, but rare functional mutations are hard to distinguish from large numbers of passengers. We developed a novel computational platform applying a multi-modal approach to filter out passengers and more robustly identify putative driver genes. The primary filter identifies enrichment of cancer mutations in CATH functional families (CATH-FunFams) – structurally and functionally coherent sets of evolutionary related domains. Using structural representatives from CATH-FunFams, we subsequently seek enrichment of mutations in 3D and show that these mutation clusters have a very significant tendency to lie close to known functional sites or conserved sites predicted using CATH-FunFams. Our third filter identifies enrichment of putative driver genes in functionally coherent protein network modules confirmed by literature analysis to be cancer associated.Our approach is complementary to other domain enrichment approaches exploiting Pfam families, but benefits from more functionally coherent groupings of domains. Using a set of mutations from 22 cancers we detect 151 putative cancer drivers, of which 79 are not listed in cancer resources and include recently validated cancer genes EPHA7, DCC netrin-1 receptor and zinc-finger protein ZNF479.

Download Full-text

DriverRWH: Discovering Cancer Driver Genes By Random Walk On a Gene Mutation Hypergraph

10.21203/rs.3.rs-1192205/v1 ◽

2021 ◽

Author(s):

Chenye Wang ◽

Junhan Shi ◽

Jiansheng Cai ◽

Yusen Zhang ◽

Xiaoqi Zheng ◽

...

Keyword(s):

Random Walk ◽

Candidate Genes ◽

Gene Mutation ◽

Network Data ◽

Cumulative Number ◽

Driver Genes ◽

Cancer Driver ◽

Cancer Types ◽

Mutation Data ◽

Cancer Driver Genes

Abstract Background: Recent advances in next-generation sequencing technologies have helped investigators generate massive amounts of cancer genomic data. A critical challenge in cancer genomics is identification of a few driver mutation genes from a much larger number of passenger mutation genes. However, majority of existing computational approaches underuse the co-occurrence information of the individuals, which deems to be important in tumorigenesis and tumor progression. Driver gene list predicted from these tools are prone to be false positive, recent research is far from achieving the ultimate goal of discovering a complete catalog of driver genes. Results: To make full use of co-mutation information, we present a random walk algorithm referred to as DriverRWH on a weighted gene mutation hypergraph model, using somatic mutation data and molecular interaction network data to prioritize candidate driver genes. Applied to tumor samples of different cancer types from The Cancer Genome Atlas (TCGA), DriverRWH shows significantly better performance than state-of-art prioritization methods in terms of the area under the curve (AUC) scores and the cumulative number of known driver genes recovered in top-ranked candidate genes. DriverRWH recovers approximately 50% known driver genes in the top 30 ranked candidate genes for more than half of the cancer types. In addition, DriverRWH is also highly robust to perturbations in the mutation data and gene functional network data. Conclusion: DriverRWH is effective among various cancer types in prioritizes cancer driver genes and provides considerable improvement over other tools with a better balance of precision and sensitivity. It can be a useful tool for detecting potential driver genes and facilitate targeted cancer therapies.

Download Full-text

Recent selection is a major force driving cancer evolution

10.1101/2021.12.27.474305 ◽

2021 ◽

Author(s):

Langyu Gu ◽

Guofen Yang

Keyword(s):

Asian Population ◽

Incidence Rates ◽

Human Populations ◽

Cancer Evolution ◽

South Asian Population ◽

Driver Genes ◽

Cancer Driver ◽

Cancer Types ◽

Cancer Driver Genes ◽

Recent Selection

Cancer is one of the most threatening diseases to humans. Understanding the evolution of cancer genes is helpful for therapy management. However, systematic investigation of the evolution of cancer driver genes is sparse. Using comparative genomic analysis, population genetics analysis and computational molecular evolutionary analysis, we detected the evolution of 568 cancer driver genes of 66 cancer types across the primate phylogeny (long timescale selection), and in modern human populations from the 1000 human genomics project (recent selection). We found that recent selection pressures, rather than long timescale selection, significantly affect the evolution of cancer driver genes in humans. Cancer driver genes related to morphological traits and local adaptation are under positive selection in different human populations. The African population showed the largest extent of divergence compared to other populations. It is worth noting that the corresponding cancer types of positively selected genes exhibited population-specific patterns, with the South Asian population possessing the least numbers of cancer types. This helps explain why the South Asian population usually has low cancer incidence rates. Population-specific patterns of cancer types whose driver genes are under positive selection also give clues to explain discrepancies of cancer incidence rates in different geographical populations, such as the high incidence rate of Wilms tumour in the African population and of Ewing's sarcomas in the European population. Our findings are thus helpful for understanding cancer evolution and providing guidance for further precision medicine.

Download Full-text

Contextual Classifications of Cancer Driver Genes

10.1101/715508 ◽

2019 ◽

Author(s):

Pramod Chandrashekar ◽

Navid Ahmadinejad ◽

Junwen Wang ◽

Aleksandar Sekulic ◽

Jan B. Egan ◽

...

Keyword(s):

Computational Method ◽

Cancer Type ◽

Sequencing Data ◽

Multiple Cancer ◽

Driver Genes ◽

Cancer Driver ◽

Link Type ◽

Mutational Hotspots ◽

Cancer Types ◽

Cancer Driver Genes

ABSTRACTFunctions of cancer driver genes depend on cellular contexts that vary substantially across tissues and organs. Distinguishing oncogenes (OGs) and tumor suppressor genes (TSGs) for each cancer type is critical to identifying clinically actionable targets. However, current resources for context-aware classifications of cancer drivers are limited. In this study, we show that the direction and magnitude of somatic selection of missense and truncating mutations of a gene are suggestive of its contextual activities. By integrating these features with ratiometric and conservation measures, we developed a computational method to categorize OGs and TSGs using exome sequencing data. This new method, named genes under selection in tumors (GUST) shows an overall accuracy of 0.94 when tested on manually curated benchmarks. Application of GUST to 10,172 tumor exomes of 33 cancer types identified 98 OGs and 179 TSGs, >70% of which promote tumorigenesis in only one cancer type. In broad-spectrum drivers shared across multiple cancer types, we found heterogeneous mutational hotspots modifying distinct functional domains, implicating the synchrony of convergent and divergent disease mechanisms. We further discovered two novel OGs and 28 novel TSGs with high confidence. The GUST program is available at https://github.com/liliulab/gust. A database with pre-computed classifications is available at https://liliulab.shinyapps.io/gust

Download Full-text

Evaluating the Evaluation of Cancer Driver Genes

10.1101/060426 ◽

2016 ◽

Cited By ~ 1

Author(s):

Collin J. Tokheim ◽

Nickolas Papadopoulis ◽

Kenneth W. Kinzler ◽

Bert Vogelstein ◽

Rachel Karchin

Keyword(s):

Machine Learning ◽

False Positive ◽

Gold Standard ◽

Gene Prediction ◽

Mutation Rates ◽

Driver Gene ◽

Prediction Methods ◽

Driver Genes ◽

Cancer Driver ◽

Human Cancers

AbstractSequencing has identified millions of somatic mutations in human cancers, but distinguishing cancer driver genes remains a major challenge. Numerous methods have been developed to identify driver genes, but evaluation of the performance of these methods is hindered by the lack of a gold standard, i.e., bona fide driver gene mutations. Here, we establish an evaluation framework that can be applied when a gold standard is not available. We used this framework to compare the performance of eight driver gene prediction methods. One of these methods, newly described here, incorporated a machine learning-based ratiometric approach. We show that the driver genes predicted by each of these eight methods vary widely. Moreover, the p-values reported by several of the methods were inconsistent with the uniform values expected, thus calling into question the assumptions that were used to generate them. Finally, we evaluated the potential effects of unexplained variability in mutation rates on false positive driver gene predictions. Our analysis points to the strengths and weaknesses of each of the currently available methods and offers guidance for improving them in the future.SignificanceModern large-scale sequencing of human cancers seeks to comprehensively discover mutated genes that confer a selective advantage to cancer cells. Key to this effort has been development of computational algorithms to find genes that drive cancer, based on their patterns of mutation in large patient cohorts. However, since there is no generally accepted gold standard of driver genes, it has been difficult to quantitatively compare these methods. We present a new machine learning method for driver gene prediction and a rigorous protocol to evaluate and compare prediction methods. Our results suggest that most current methods do not adequately account for heterogeneity in the number of mutations expected by chance and consequently have many false positive calls. The problem is most acute for cancers with high mutation rates and comprehensive discovery of drivers in these cancers may be more difficult than currently anticipated.

Download Full-text

MEXCOWalk: Mutual Exclusion and Coverage Based Random Walk to Identify Cancer Modules

10.1101/547653 ◽

2019 ◽

Author(s):

Rafsan Ahmed ◽

Ilyes Baali ◽

Cesim Erten ◽

Evis Hoxha ◽

Hilal Kazan

Keyword(s):

Random Walk ◽

Mutual Exclusion ◽

Risk Scores ◽

Cancer Genes ◽

Multiple Cancer ◽

Driver Genes ◽

Cancer Driver ◽

Cancer Data ◽

Cancer Types ◽

Pan Cancer

AbstractMotivationGenomic analyses from large cancer cohorts have revealed the mutational heterogeneity problem which hinders the identification of driver genes based only on mutation profiles. One way to tackle this problem is to incorporate the fact that genes act together in functional modules. The connectivity knowledge present in existing protein-protein interaction networks together with mutation frequencies of genes and the mutual exclusivity of cancer mutations can be utilized to increase the accuracy of identifying cancer driver modules.ResultsWe present a novel edge-weighted random walk-based approach that incorporates connectivity information in the form of protein-protein interactions, mutual exclusion, and coverage to identify cancer driver modules. MEXCOWalk outperforms several state-of-the-art computational methods on TCGA pan-cancer data in terms of recovering known cancer genes, providing modules that are capable of classifying normal and tumor samples, and that are enriched for mutations in specific cancer types. Furthermore, the risk scores determined with output modules can stratify patients into low-risk and high-risk groups in multiple cancer types. MEXCOwalk identifies modules containing both well-known cancer genes and putative cancer genes that are rarely mutated in the pan-cancer data. The data, the source code, and useful scripts are available at:https://github.com/abu-compbio/[email protected]

Download Full-text

Directional association test reveals high-quality putative cancer driver biomarkers including noncoding RNAs

BMC Medical Genomics ◽

10.1186/s12920-019-0565-9 ◽

2019 ◽

Vol 12 (S7) ◽

Cited By ~ 2

Author(s):

Hua Zhong ◽

Mingzhou Song

Keyword(s):

Myeloid Leukemia ◽

Noncoding Rnas ◽

Causative Role ◽

Cancer Genes ◽

High Quality ◽

Driver Genes ◽

Cancer Driver ◽

Functional Relationships ◽

Model Free ◽

Cancer Driver Genes

Abstract Background Most statistical methods used to identify cancer driver genes are either biased due to choice of assumed parametric models or insensitive to directional relationships important for causal inference. To overcome modeling biases and directional insensitivity, a recent statistical functional chi-squared test (FunChisq) detects directional association via model-free functional dependency. FunChisq examines patterns pointing from independent to dependent variables arising from linear, non-linear, or many-to-one functional relationships. Meanwhile, the Functional Annotation of Mammalian Genome 5 (FANTOM5) project surveyed gene expression at over 200,000 transcription start sites (TSSs) in nearly all human tissue types, primary cell types, and cancer cell lines. The data cover TSSs originated from both coding and noncoding genes. For the vast uncharacterized human TSSs that may exhibit complex patterns in cancer versus normal tissues, the model-free property of FunChisq provides us an unprecedented opportunity to assess the evidence for a gene’s directional effect on human cancer. Results We first evaluated FunChisq and six other methods using 719 curated cancer genes on the FANTOM5 data. FunChisq performed best in detecting known cancer driver genes from non-cancer genes. We also show the capacity of FunChisq to reveal non-monotonic patterns of functional association, to which typical differential analysis methods such as t-test are insensitive. Further applying FunChisq to screen unannotated TSSs in FANTOM5, we predicted 1108 putative cancer driver noncoding RNAs, stronger than 90% of curated cancer driver genes. Next, we compared leukemia samples against other samples in FANTOM5 and FunChisq predicted 332/79 potential biomarkers for lymphoid/myeloid leukemia, stronger than the TSSs of all 87/100 known driver genes in lymphoid/myeloid leukemia. Conclusions This study demonstrated the advantage of FunChisq in revealing directional association, especially in detecting non-monotonic patterns. Here, we also provide the most comprehensive catalog of high-quality biomarkers that may play a causative role in human cancers, including putative cancer driver noncoding RNAs and lymphoid/myeloid leukemia specific biomarkers.

Download Full-text