Deep learning for cancer type classification

ABSTRACTGenetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. To address these limitations, we propose DeepCues, a deep learning model that utilizes convolutional neural networks to derive features from DNA sequencing data for disease classification and relevant gene discovery. Using whole-exome sequencing, germline variants and somatic mutations, including insertions and deletions, are interactively amalgamated as features. In this study, we applied DeepCues to a dataset from TCGA to classify seven different types of major cancers and obtained an overall accuracy of 77.6%. We compared DeepCues to conventional methods and demonstrated a significant overall improvement (p=8.8E-25). Using DeepCues, we found that the top 20 genes associated with breast cancer have a 40% overlap with the top 20 breast cancer genes in the COSMIC database. These data support DeepCues as a novel method to improve the representational resolution of both germline variants and somatic mutations interactively and their power in predicting cancer types, as well the genes involved in each cancer.

Download Full-text

Deep learning for cancer type classification and driver gene identification

BMC Bioinformatics ◽

10.1186/s12859-021-04400-4 ◽

2021 ◽

Vol 22 (S4) ◽

Author(s):

Zexian Zeng ◽

Chengsheng Mao ◽

Andy Vo ◽

Xiaoyu Li ◽

Janna Ore Nugent ◽

...

Keyword(s):

Breast Cancer ◽

Deep Learning ◽

Somatic Mutations ◽

Disease Classification ◽

Driver Gene ◽

Cancer Type ◽

Sequencing Data ◽

Germline Variants ◽

Insertions And Deletions ◽

Novel Method

Abstract Background Genetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. We aim to develop a novel method to effectively explore the landscape of genetic variants, including germline variants, and small insertions and deletions for cancer type prediction. Results We proposed DeepCues, a deep learning model that utilizes convolutional neural networks to unbiasedly derive features from raw cancer DNA sequencing data for disease classification and relevant gene discovery. Using raw whole-exome sequencing as features, germline variants and somatic mutations, including insertions and deletions, were interactively amalgamated for feature generation and cancer prediction. We applied DeepCues to a dataset from TCGA to classify seven different types of major cancers and obtained an overall accuracy of 77.6%. We compared DeepCues to conventional methods and demonstrated a significant overall improvement (p < 0.001). Strikingly, using DeepCues, the top 20 breast cancer relevant genes we have identified, had a 40% overlap with the top 20 known breast cancer driver genes. Conclusion Our results support DeepCues as a novel method to improve the representational resolution of DNA sequencings and its power in deriving features from raw sequences for cancer type prediction, as well as discovering new cancer relevant genes.

Download Full-text

Rate of incidental germline findings detected by tumor-normal matched sequencing in cancer types lacking hereditary cancer testing guidelines.

Journal of Clinical Oncology ◽

10.1200/jco.2021.39.15_suppl.10582 ◽

2021 ◽

Vol 39 (15_suppl) ◽

pp. 10582-10582

Author(s):

Timothy A. Yap ◽

Arya Ashok ◽

Jessica Stoll ◽

Anna Ewa Schwarzbach ◽

Kimberly L. Blackwell ◽

...

Keyword(s):

Bile Duct ◽

Hereditary Cancer ◽

Next Generation Sequencing Data ◽

Cancer Type ◽

Cancer Genes ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Germline Variants ◽

Cancer Types ◽

Germline Testing

10582 Background: Up to 10% of all cancers are associated with hereditary cancer syndromes; however, guidelines for germline testing are currently limited to patients and families with specific cancer types (ovarian, breast, prostate, pancreatic, etc.). Although germline alterations have been shown in genes associated with cancers such as bile-duct, head & neck, brain, bladder, esophageal, and lung cancers, genetic testing is not routinely offered (PMID: 28873162). In such cancers, a guidelines-based approach may fail to detect cancer risk variants found by tumor-normal (T/N) matched sequencing. Here, we report the prevalence of incidental germline findings in patients with the aforementioned 6 cancer types and highlight frequently mutated genes by cancer type. Methods: We retrospectively analyzed next-generation sequencing data from de-identified records of 19,630 patients tested using Tempus|xT T/N matched assay. Incidental germline findings (i.e., single nucleotide variants and small insertions/deletions) detected in 50 hereditary cancer genes were determined for: bile duct (n = 466), head & neck (n = 673), esophageal (n = 395), brain (n = 1,391), bladder (n = 810), and lung (n = 5,544), where n = total patients. For comparison, we also included 4 cancer types that frequently undergo germline testing: ovarian (n = 2,042), breast (n = 3,542), prostate (n = 2,146), and pancreatic (n = 2,621). Results: We detected incidental pathogenic/likely pathogenic germline variants (P/LPV) in 6.5% (601/9,279) of patients diagnosed with the 6 selected cancer types lacking hereditary cancer testing guidelines. The highest prevalence of P/LPV was identified in patients with bladder (8%), brain (6.9%), and lung (6.5%) cancers. Frequently mutated genes (Table) include ATM (n = 62), BRCA2 (n = 60), BRCA1 (n = 33), APC (n = 27), and CHEK2 (n = 21). Of note, the Ashkenazi Jewish variant (p.I1307K) was the most frequent mutation in APC. For cancer types where patients frequently undergo germline testing, the rates of incidental germline findings in descending order were ovarian (15%), breast (12%), prostate (9.4%), and pancreatic (8.5%) cancers. Conclusions: In addition to enhanced variant calling, T/N matched sequencing may identify germline variants missed by a guidelines-based approach to testing. The identification of such germline findings may have clinical implications for the patient, as well as at-risk family members, thereby resulting in the opportunity for genetic counseling and risk-stratified intervention.[Table: see text]

Download Full-text

Pathway-based dissection of the genomic heterogeneity of cancer hallmarks’ acquisition with SLAPenrich

10.1101/077701 ◽

2016 ◽

Cited By ~ 6

Author(s):

Francesco Iorio ◽

Luz Garcia-Alonso ◽

Jonathan S. Brammeld ◽

Iñigo Martincorena ◽

David R. Wille ◽

...

Keyword(s):

Somatic Mutations ◽

Population Level ◽

Computational Method ◽

Driver Mutations ◽

Cancer Type ◽

Cancer Genes ◽

Driver Genes ◽

Cancer Hallmarks ◽

Pathway Gene ◽

Cancer Types

ABSTRACTCancer hallmarks are evolutionary traits required by a tumour to develop. While extensively characterised, the way these traits are achieved through the accumulation of somatic mutations in key biological pathways is not fully understood. To shed light on this subject, we characterised the landscape of pathway alterations associated with somatic mutations observed in 4,415 patients across ten cancer types, using 374 orthogonal pathway gene-sets mapped onto canonical cancer hallmarks. Towards this end, we developed SLAPenrich: a computational method based on population-level statistics, freely available as an open source R package. Assembling the identified pathway alterations into sets of hallmark signatures allowed us to connect somatic mutations to clinically interpretable cancer mechanisms. Further, we explored the heterogeneity of these signatures, in terms of ratio of altered pathways associated with each individual hallmark, assuming that this is reflective of the extent of selective advantage provided to the cancer type under consideration. Our analysis revealed the predominance of certain hallmarks in specific cancer types, thus suggesting different evolutionary trajectories across cancer lineages.Finally, although many pathway alteration enrichments are guided by somatic mutations in frequently altered high-confidence cancer genes, excluding these driver mutations preserves the hallmark heterogeneity signatures, thus the detected hallmarks’ predominance across cancer types. As a consequence, we propose the hallmark signatures as a ground truth to characterise tails of infrequent genomic alterations and identify potential novel cancer driver genes and networks.

Download Full-text

Classification of breast cancer types, sub-types and grade from histopathological images using deep learning technique

Health and Technology ◽

10.1007/s12553-021-00592-0 ◽

2021 ◽

Author(s):

Elbetel Taye Zewdie ◽

Abel Worku Tessema ◽

Gizeaddis Lamesgin Simegn

Keyword(s):

Breast Cancer ◽

Deep Learning ◽

Learning Technique ◽

Histopathological Images ◽

Cancer Types

Download Full-text

Deep learning with evolutionary and genomic profiles for identifying cancer subtypes

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720019400055 ◽

2019 ◽

Vol 17 (03) ◽

pp. 1940005 ◽

Cited By ~ 2

Author(s):

Chun-Yu Lin ◽

Peiying Ruan ◽

Ruiming Li ◽

Jinn-Moon Yang ◽

Simon See ◽

...

Keyword(s):

Breast Cancer ◽

Deep Learning ◽

Evolutionary Conservation ◽

Core Gene ◽

Breast Cancer Subtypes ◽

Gene Set ◽

Cancer Subtypes ◽

The Core ◽

Cancer Types ◽

Core Genes

Cancer subtype identification is an unmet need in precision diagnosis. Recently, evolutionary conservation has been indicated to contain informative signatures for functional significance in cancers. However, the importance of evolutionary conservation in distinguishing cancer subtypes remains largely unclear. Here, we identified the evolutionarily conserved genes (i.e. core genes) and observed that they are primarily involved in cellular pathways relevant to cell growth and metabolisms. By using these core genes, we developed two novel strategies, namely a feature-based strategy (FES) and an image-based strategy (IMS) by integrating their evolutionary and genomic profiles with the deep learning algorithm. In comparison with the FES using the random set and the strategy using the PAM50 classifier, the core gene set-based FES achieved a higher accuracy for identifying breast cancer subtypes. The IMS and FES using the core gene set yielded better performances than the other strategies, in terms of classifying both breast cancer subtypes and multiple cancer types. Moreover, the IMS is reproducible even using different gene expression data (i.e. RNA-seq and microarray). Comprehensive analysis of eight cancer types demonstrates that our evolutionary conservation-based models represent a valid and helpful approach for identifying cancer subtypes and the core gene set offers distinguishable clues of cancer subtypes.

Download Full-text

Immunohistological Expression of SOX-10 in Triple-Negative Breast Cancer: A Descriptive Analysis of 113 Samples

International Journal of Molecular Sciences ◽

10.3390/ijms21176407 ◽

2020 ◽

Vol 21 (17) ◽

pp. 6407 ◽

Cited By ~ 1

Author(s):

Katharina Kriegsmann ◽

Christa Flechtenmacher ◽

Jörg Heil ◽

Jörg Kriegsmann ◽

Gunhild Mechtersheimer ◽

...

Keyword(s):

Breast Cancer ◽

Triple Negative Breast Cancer ◽

Triple Negative ◽

Descriptive Analysis ◽

Biological Significance ◽

Genetic Alterations ◽

Prognostic Impact ◽

Cancer Genes ◽

Sequencing Data ◽

Disease Free

Background: SRY-related HMG-box 10 (SOX-10) is commonly expressed in triple negative breast cancer (TNBC). However, data on the biological significance of SOX-10 expression is limited. Therefore, we investigated immunhistological SOX-10 expression in TNBC and correlated the results with genetic alterations and clinical data. Methods: A tissue microarray including 113 TNBC cases was stained by SOX-10. Immunohistological data of AR, BCL2, CD117, p53 and Vimentin was available from a previous study. Semiconductor-based panel sequencing data including commonly altered breast cancer genes was also available from a previous investigation. SOX-10 expression was correlated with clinicopathological, immunohistochemical and genetic data. Results: SOX-10 was significantly associated with CD117 and Vimentin, but not with AR expression. An association of SOX-10 with BCL2, EGFR or p53 staining was not observed. SOX-10-positive tumors harbored more often TP53 mutations but less frequent mutations of PIK3CA or alterations of the PIK3K pathway. SOX-10 expression had no prognostic impact either on disease-free, distant disease-free, or overall survival. Conclusions: While there might be a value of SOX-10 as a differential diagnostic marker to identify metastases of TNBC, its biological role remains to be investigated.

Download Full-text

Germline variants associated with leukocyte genes predict tumor recurrence in breast cancer patients

npj Precision Oncology ◽

10.1038/s41698-019-0100-7 ◽

2019 ◽

Vol 3 (1) ◽

Cited By ~ 9

Author(s):

Jean-Sébastien Milanese ◽

Chabane Tibiche ◽

Jinfeng Zou ◽

Zhigang Meng ◽

Andre Nantel ◽

...

Keyword(s):

Breast Cancer ◽

Cancer Patients ◽

Clinical Outcomes ◽

Cell Function ◽

Recurrence Score ◽

Oncotype Dx ◽

Breast Cancer Patients ◽

Sequencing Data ◽

Germline Variants ◽

Gene Signatures

Abstract Germline variants such as BRCA1/2 play an important role in tumorigenesis and clinical outcomes of cancer patients. However, only a small fraction (i.e., 5–10%) of inherited variants has been associated with clinical outcomes (e.g., BRCA1/2, APC, TP53, PTEN and so on). The challenge remains in using these inherited germline variants to predict clinical outcomes of cancer patient population. In an attempt to solve this issue, we applied our recently developed algorithm, eTumorMetastasis, which constructs predictive models, on exome sequencing data to ER+ breast (n = 755) cancer patients. Gene signatures derived from the genes containing functionally germline variants significantly distinguished recurred and non-recurred patients in two ER+ breast cancer independent cohorts (n = 200 and 295, P = 1.4 × 10−3). Furthermore, we compared our results with the widely known Oncotype DX test (i.e., Oncotype DX breast cancer recurrence score) and outperformed prediction for both high- and low-risk groups. Finally, we found that recurred patients possessed a higher rate of germline variants. In addition, the inherited germline variants from these gene signatures were predominately enriched in T cell function, antigen presentation, and cytokine interactions, likely impairing the adaptive and innate immune response thus favoring a pro-tumorigenic environment. Hence, germline genomic information could be used for developing non-invasive genomic tests for predicting patients’ outcomes in breast cancer.

Download Full-text

Noncoding RNAs and Deep Learning Neural Network Discriminate Multi-Cancer Types

Cancers ◽

10.3390/cancers14020352 ◽

2022 ◽

Vol 14 (2) ◽

pp. 352

Author(s):

Anyou Wang ◽

Rong Hai ◽

Paul J. Rider ◽

Qianchuan He

Keyword(s):

Neural Network ◽

Deep Learning ◽

Cancer Screening ◽

Detection System ◽

Population Level ◽

Cancer Type ◽

Multiple Cancer ◽

Data Set ◽

Cancer Types ◽

Deep Learning Neural Network

Detecting cancers at early stages can dramatically reduce mortality rates. Therefore, practical cancer screening at the population level is needed. To develop a comprehensive detection system to classify multiple cancer types. We integrated an artificial intelligence deep learning neural network and noncoding RNA biomarkers selected from massive data. Our system can accurately detect cancer vs. healthy objects with 96.3% of AUC of ROC (Area Under Curve of a Receiver Operating Characteristic curve), and it surprisingly reaches 78.77% of AUC when validated by real-world raw data from a completely independent data set. Even validating with raw exosome data from blood, our system can reach 72% of AUC. Moreover, our system significantly outperforms conventional machine learning models, such as random forest. Intriguingly, with no more than six biomarkers, our approach can easily discriminate any individual cancer type vs. normal with 99% to 100% AUC. Furthermore, a comprehensive marker panel can simultaneously multi-classify common cancers with a stable 82.15% accuracy rate for heterogeneous cancerous tissues and conditions.: This detection system provides a promising practical framework for automatic cancer screening at population level. Key points: (1) We developed a practical cancer screening system, which is simple, accurate, affordable, and easy to operate. (2) Our system binarily classify cancers vs. normal with >96% AUC. (3) In total, 26 individual cancer types can be easily detected by our system with 99 to 100% AUC. (4) The system can detect multiple cancer types simultaneously with >82% accuracy.

Download Full-text

Comprehensive analysis of clustered mutations in cancer reveals recurrent APOBEC3 mutagenesis of ecDNA

10.1101/2021.05.27.445689 ◽

2021 ◽

Author(s):

Erik N Bergstrom ◽

Jens-Christian Luebeck ◽

Mia Petljak ◽

Vineet Bafna ◽

Paul S. Mischel ◽

...

Keyword(s):

Somatic Mutations ◽

Human Cancer ◽

Cancer Genes ◽

Base Substitutions ◽

Cancer Genomes ◽

Cancer Types ◽

Mutational Processes ◽

Comprehensive Characterization

Clustered somatic mutations are common in cancer genomes with prior analyses revealing several types of clustered single-base substitutions, including doublet- and multi-base substitutions, diffuse hypermutation termed omikli, and longer strand-coordinated events termed kataegis. Here, we provide a comprehensive characterization of clustered substitutions and clustered small insertions and deletions (indels) across 2,583 whole-genome sequenced cancers from 30 cancer types. While only 3.7% of substitutions and 0.9% of indels were found to be clustered, they contributed 8.4% and 6.9% of substitution and indel drivers, respectively. Multiple distinct mutational processes gave rise to clustered indels including signatures enriched in tobacco smokers and homologous-recombination deficient cancers. Doublet-base substitutions were caused by at least 12 mutational processes, while the majority of multi-base substitutions were generated by either tobacco smoking or exposure to ultraviolet light. Omikli events, previously attributed to the activity of APOBEC3 deaminases, accounted for a large proportion of clustered substitutions. However, only 16.2% of omikli matched APOBEC3 patterns with experimental validation confirming additional mutational processes giving rise to omikli. Kataegis was generated by multiple mutational processes with 76.1% of all kataegic events exhibiting AID/APOBEC3-associated mutational patterns. Co-occurrence of APOBEC3 kataegis and extrachromosomal-DNA (ecDNA) was observed in 31% of samples with ecDNA. Multiple distinct APOBEC3 kataegic events were observed on most mutated ecDNA. ecDNA containing known cancer genes exhibited both positive selection and kataegic hypermutation. Our results reveal the diversity of clustered mutational processes in human cancer and the role of APOBEC3 in recurrently mutating and fueling the evolution of ecDNA.

Download Full-text

Germline genetic variants associated with leukocyte-genes predict tumor recurrence in breast cancer patients

10.1101/312355 ◽

2018 ◽

Author(s):

Jean-Sébastien Milanese ◽

Chabane Tibiche ◽

Jinfeng Zou ◽

Zhi Gang Meng ◽

Andre Nantel ◽

...

Keyword(s):

Breast Cancer ◽

Cancer Patients ◽

Clinical Outcomes ◽

Genetic Variants ◽

Cell Function ◽

Breast Cancer Patients ◽

Sequencing Data ◽

Germline Variants ◽

Gene Signatures ◽

Cytokine Interactions

AbstractGermline genetic variants such as BRCA1/2 play an important role in tumorigenesis and clinical outcomes of cancer patients. However, only a small fraction (i.e., 5-10%) of inherited variants has been associated with clinical outcomes (e.g., BRCA1/2, APC, TP53, PTEN and so on). The challenge remains in using these inherited germline variants to predict clinical outcomes of cancer patient population. In an attempt to solve this issue, we applied our recently developed algorithm, eTumorMetastasis, which constructs predictive models, on exome sequencing data to ER+ breast (n=755) cancer patients. Gene signatures derived from the genes containing functionally germline genetic variants significantly distinguished recurred and non-recurred patients in two ER+ breast cancer independent cohorts (n=200 and 295, P=1.4×10−3). Furthermore, we found that recurred patients possessed a higher rate of germline genetic variants. In addition, the inherited germline variants from these gene signatures were predominately enriched in T cell function, antigen presentation and cytokine interactions, likely impairing the adaptive and innate immune response thus favoring a pro-tumorigenic environment. Hence, germline genomic information could be used for developing non-invasive genomic tests for predicting patients’ outcomes (or drug response) in breast cancer, other cancer types and even other complex diseases.

Download Full-text