Cancer subtype classification and modeling by pathway attention and propagation

Abstract Motivation Biological pathway is an important curated knowledge of biological processes. Thus, cancer subtype classification based on pathways will be very useful to understand differences in biological mechanisms among cancer subtypes. However, pathways include only a fraction of the entire gene set, only one-third of human genes in KEGG, and pathways are fragmented. For this reason, there are few computational methods to use pathways for cancer subtype classification. Results We present an explainable deep-learning model with attention mechanism and network propagation for cancer subtype classification. Each pathway is modeled by a graph convolutional network. Then, a multi-attention-based ensemble model combines several hundreds of pathways in an explainable manner. Lastly, network propagation on pathway–gene network explains why gene expression profiles in subtypes are different. In experiments with five TCGA cancer datasets, our method achieved very good classification accuracies and, additionally, identified subtype-specific pathways and biological functions. Availability and implementation The source code is available at http://biohealth.snu.ac.kr/software/GCN_MAE. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identifying Cancer Subtypes Using a Residual Graph Convolution Model on a Sample Similarity Network

Genes ◽

10.3390/genes13010065 ◽

2021 ◽

Vol 13 (1) ◽

pp. 65

Author(s):

Wei Dai ◽

Wenhao Yue ◽

Wei Peng ◽

Xiaodong Fu ◽

Li Liu ◽

...

Keyword(s):

Expression Profiles ◽

Expression Patterns ◽

Gene Expression Profiles ◽

Convolutional Network ◽

Similarity Network ◽

Cancer Subtypes ◽

Subtype Classification ◽

Convolution Model ◽

Cancer Subtype ◽

Residual Graph

Cancer subtype classification helps us to understand the pathogenesis of cancer and develop new cancer drugs, treatment from which patients would benefit most. Most previous studies detect cancer subtypes by extracting features from individual samples, ignoring their associations with others. We believe that the interactions of cancer samples can help identify cancer subtypes. This work proposes a cancer subtype classification method based on a residual graph convolutional network and a sample similarity network. First, we constructed a sample similarity network regarding cancer gene co-expression patterns. Then, the gene expression profiles of cancer samples as initial features and the sample similarity network were passed into a two-layer graph convolutional network (GCN) model. We introduced the initial features to the GCN model to avoid over-smoothing during the training process. Finally, the classification of cancer subtypes was obtained through a softmax activation function. Our model was applied to breast invasive carcinoma (BRCA), glioblastoma multiforme (GBM) and lung cancer (LUNG) datasets. The accuracy values of our model reached 82.58%, 85.13% and 79.18% for BRCA, GBM and LUNG, respectively, which outperformed the existing methods. The survival analysis of our results proves the significant clinical features of the cancer subtypes identified by our model. Moreover, we can leverage our model to detect the essential genes enriched in gene ontology (GO) terms and the biological pathways related to a cancer subtype.

Download Full-text

Robust partial reference-free cell composition estimation from tissue expression

Bioinformatics ◽

10.1093/bioinformatics/btaa184 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3431-3438

Author(s):

Ziyi Li ◽

Zhenxing Guo ◽

Ying Cheng ◽

Peng Jin ◽

Hao Wu

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Real Data ◽

Estimation Procedure ◽

Free Cell ◽

Biological Information ◽

Supplementary Information ◽

Tissue Samples ◽

Cell Composition ◽

Heterogeneous Tissues

Abstract Motivation In the analysis of high-throughput omics data from tissue samples, estimating and accounting for cell composition have been recognized as important steps. High cost, intensive labor requirements and technical limitations hinder the cell composition quantification using cell-sorting or single-cell technologies. Computational methods for cell composition estimation are available, but they are either limited by the availability of a reference panel or suffer from low accuracy. Results We introduce TOols for the Analysis of heterogeneouS Tissues TOAST/-P and TOAST/+P, two partial reference-free algorithms for estimating cell composition of heterogeneous tissues based on their gene expression profiles. TOAST/-P and TOAST/+P incorporate additional biological information, including cell-type-specific markers and prior knowledge of compositions, in the estimation procedure. Extensive simulation studies and real data analyses demonstrate that the proposed methods provide more accurate and robust cell composition estimation than existing methods. Availability and implementation The proposed methods TOAST/-P and TOAST/+P are implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identify Breast Cancer Subtypes by Gene Expression Profiles

Journal of Data Science ◽

10.6339/jds.2004.02(2).210 ◽

2021 ◽

Vol 2 (2) ◽

pp. 165-175

Author(s):

Grace S. Shieh ◽

Chy-Huei Bai ◽

Chih Lee

Keyword(s):

Breast Cancer ◽

Gene Expression ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Breast Cancer Subtypes ◽

Cancer Subtypes

Download Full-text

Cancer classification of single-cell gene expression data by neural network

Bioinformatics ◽

10.1093/bioinformatics/btz772 ◽

2019 ◽

Cited By ~ 3

Author(s):

Bong-Hyun Kim ◽

Kijin Yu ◽

Peter C W Lee

Keyword(s):

Neural Network ◽

Gene Expression ◽

Single Cell ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Cancer Classification ◽

Supplementary Information ◽

Support Vector ◽

K Nearest Neighbors ◽

Normal Tissues

Abstract Motivation Cancer classification based on gene expression profiles has provided insight on the causes of cancer and cancer treatment. Recently, machine learning-based approaches have been attempted in downstream cancer analysis to address the large differences in gene expression values, as determined by single-cell RNA sequencing (scRNA-seq). Results We designed cancer classifiers that can identify 21 types of cancers and normal tissues based on bulk RNA-seq as well as scRNA-seq data. Training was performed with 7398 cancer samples and 640 normal samples from 21 tumors and normal tissues in TCGA based on the 300 most significant genes expressed in each cancer. Then, we compared neural network (NN), support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF) methods. The NN performed consistently better than other methods. We further applied our approach to scRNA-seq transformed by kNN smoothing and found that our model successfully classified cancer types and normal samples. Availability and implementation Cancer classification by neural network. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

NITUMID: Nonnegative matrix factorization-based Immune-TUmor MIcroenvironment Deconvolution

Bioinformatics ◽

10.1093/bioinformatics/btz748 ◽

2019 ◽

Author(s):

Daiwei Tang ◽

Seyoung Park ◽

Hongyu Zhao

Keyword(s):

Tumor Microenvironment ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Expression Profiles ◽

Mrna Level ◽

Nonnegative Matrix ◽

Gene Expression Profiles ◽

Cell Types ◽

Supplementary Information ◽

Mrna Levels

Abstract Motivation A number of computational methods have been proposed recently to profile tumor microenvironment (TME) from bulk RNA data, and they have proved useful for understanding microenvironment differences among therapeutic response groups. However, these methods are not able to account for tumor proportion nor variable mRNA levels across cell types. Results In this article, we propose a Nonnegative Matrix Factorization-based Immune-TUmor MIcroenvironment Deconvolution (NITUMID) framework for TME profiling that addresses these limitations. It is designed to provide robust estimates of tumor and immune cells proportions simultaneously, while accommodating mRNA level differences across cell types. Through comprehensive simulations and real data analyses, we demonstrate that NITUMID not only can accurately estimate tumor fractions and cell types’ mRNA levels, which are currently unavailable in other methods; it also outperforms most existing deconvolution methods in regular cell type profiling accuracy. Moreover, we show that NITUMID can more effectively detect clinical and prognostic signals from gene expression profiles in tumor than other methods. Availability and implementation The algorithm is implemented in R. The source code can be downloaded at https://github.com/tdw1221/NITUMID. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Information theoretic sub-network mining characterizes breast cancer subtypes in terms of cancer core mechanisms

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720016440029 ◽

2016 ◽

Vol 14 (05) ◽

pp. 1644002 ◽

Cited By ~ 1

Author(s):

Jinwoo Park ◽

Benjamin Hur ◽

Sungmin Rhee ◽

Sangsoo Lim ◽

Min-Su Kim ◽

...

Keyword(s):

Breast Cancer ◽

Regulatory Network ◽

Cancer Biology ◽

Breast Cancer Subtype ◽

Breast Cancer Subtypes ◽

Information Theoretic ◽

Cancer Subtypes ◽

Network Mining ◽

Subtype Classification ◽

Cancer Subtype

A breast cancer subtype classification scheme, PAM50, based on genetic information is widely accepted for clinical applications. On the other hands, experimental cancer biology studies have been successful in revealing the mechanisms of breast cancer and now the hallmarks of cancer have been determined to explain the core mechanisms of tumorigenesis. Thus, it is important to understand how the breast cancer subtypes are related to the cancer core mechanisms, but multiple studies are yet to address the hallmarks of breast cancer subtypes. Therefore, a new approach that can explain the differences among breast cancer subtypes in terms of cancer hallmarks is needed. We developed an information theoretic sub-network mining algorithm, differentially expressed sub-network and pathway analysis (DeSPA), that retrieves tumor-related genes by mining a gene regulatory network (GRN) of transcription factors and miRNAs. With extensive experiments of the cancer genome atlas (TCGA) breast cancer sequencing data, we showed that our approach was able to select genes that belong to cancer core pathways such as DNA replication, cell cycle, p53 pathways while keeping the accuracy of breast cancer subtype classification comparable to that of PAM50. In addition, our method produces a regulatory network of TF, miRNA, and their target genes that distinguish breast cancer subtypes, which is confirmed by experimental studies in the literature.

Download Full-text

A Cascade Flexible Neural Forest Model for Cancer Subtypes Classification on Gene Expression Data

Computational Intelligence and Neuroscience ◽

10.1155/2021/6480456 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Lianxin Zhong ◽

Qingfang Meng ◽

Yuehui Chen

Keyword(s):

Gene Expression ◽

Sample Size ◽

Gene Expression Data ◽

Small Sample Size ◽

Small Sample ◽

Expression Data ◽

Cancer Subtypes ◽

Subtype Classification ◽

Cancer Subtype

The correct classification of cancer subtypes is of great significance for the in-depth study of cancer pathogenesis and the realization of accurate treatment for cancer patients. In recent years, the classification of cancer subtypes using deep neural networks and gene expression data has become a hot topic. However, most classifiers may face the challenges of overfitting and low classification accuracy when dealing with small sample size and high-dimensional biological data. In this paper, the Cascade Flexible Neural Forest (CFNForest) Model was proposed to accomplish cancer subtype classification. CFNForest extended the traditional flexible neural tree structure to FNT Group Forest exploiting a bagging ensemble strategy and could automatically generate the model’s structure and parameters. In order to deepen the FNT Group Forest without introducing new hyperparameters, the multilayer cascade framework was exploited to design the FNT Group Forest model, which transformed features between levels and improved the performance of the model. The proposed CFNForest model also improved the operational efficiency and the robustness of the model by sample selection mechanism between layers and setting different weights for the output of each layer. To accomplish cancer subtype classification, FNT Group Forest with different feature sets was used to enrich the structural diversity of the model, which make it more suitable for processing small sample size datasets. The experiments on RNA-seq gene expression data showed that CFNForest effectively improves the accuracy of cancer subtype classification. The classification results have good robustness.

Download Full-text

Logic-based Analysis of Gene Expression Data Predicts Pathway Crosstalk between TNF, TGFB1 and EGF in Basal-like Breast Cancer

10.1101/614933 ◽

2019 ◽

Author(s):

Kyuri Jo ◽

Beatriz Santos Buitrago ◽

Minsu Kim ◽

Sungmin Rhee ◽

Carolyn Talcott ◽

...

Keyword(s):

Breast Cancer ◽

Gene Expression ◽

Signaling Pathways ◽

Expression Profiles ◽

Formal System ◽

Gene Expression Profiles ◽

Breast Cancer Subtypes ◽

Post Translational Modifications ◽

Cancer Subtypes ◽

Pathway Crosstalk

AbstractFor breast cancer, clinically important subtypes are well characterised at the molecular level in terms of gene expression profiles. In addition, signaling pathways in breast cancer have been extensively studied as therapeutic targets due to their roles in tumor growth and metastasis. However, it is challenging to put signaling pathways and gene expression profiles together to characterise biological mechanisms of breast cancer subtypes since many signaling events result from post-translational modifications, rather than gene expression differences.We present a logic-based approach to explain the differences in gene expression profiles among breast cancer subtypes using Pathway Logic and transcriptional network information. Pathway Logic is a rewriting-logic-based formal system for modeling biological pathways including post-translational modifications. Proposed method demonstrated its utility by constructing subtype-specific path from key receptors (TNFR, TGFBR1 and EGFR) to key transcription factor (TF) regulators (RELA, ATF2, SMAD3 and ELK1) and identifying potential pathway crosstalk via TFs in basal-specific paths, which could provide a novel insight on aggressive breast cancer subtypes.AvailabilityAnalysis result is available at http://epigenomics.snu.ac.kr/PL/

Download Full-text

Identification of significant genes with invasive promotion in non-functional pituitary adenoma via bioinformatical analysis

10.21203/rs.3.rs-146994/v1 ◽

2021 ◽

Author(s):

An Shuo Wang ◽

Hao Xu ◽

Ming Hui Zeng ◽

Fei Wang

Keyword(s):

Pituitary Adenoma ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Pituitary Tumors ◽

Receptor Interaction ◽

Protein Protein Interaction ◽

Normal Tissues ◽

Non Invasive ◽

Pathway Gene ◽

Underlying Mechanisms

Abstract Background Non-functional pituitary adenoma (NFPA) is a disease with a high incidence, which accounts for a large part of pituitary tumors and plays a pivotal role. While invasive NFPAs which have not any endocrinology manifestations and space-occupying symptoms at early stages account for about 30 percent of NFPAs. The purpose of the present academic work was to identify significant genes with invasive promotion and their underlying mechanisms. Methods Gene expression profiles of GSE51618 was available from GEO database. There are 4 non-invasive NFPA tissues, 3 invasive NFPA tissues and 3 normal tissues in the profile datasets. Differentially expressed genes (DEGs) between non-invasive NFPA tissues and invasive NFPA tissues were picked out by GEO2R online tool. There were total of 226 up-regulated genes and 298 down-regulated genes. Next, we made use of the Database for Annotation, Visualization and Integrated Discovery (DAVID) to analyze Kyoto Encyclopedia of Gene and Genome (KEGG) pathway, gene ontology (GO) and Kaplan Meier Plotter. Then protein-protein interaction (PPI) of these DEGs was visualized by Cytoscape with Search Tool for the Retrieval of Interacting Genes (STRING). There were total of 141 up-regulated genes and 171 down-regulated genes. Of PPI network analyzed by Molecular Complex Detection (MCODE) plug-in, all 141 up-regulated genes were selected. Results After reanalysis of GO, five genes (ATP2B3, ADCYAP1R1, PTGER2, FSHβ, HTR4) were found to significantly enrich in the cAMP signaling pathway, Neuroactive ligand-receptor interaction and Renin secretion via reanalysis of DAVID. Conclusions We have identified five significant up-regulated DEGs with invasive promotion in invasive NFPAs on the basis of integrated bioinformatical methods, which could be potential therapeutic targets for invasive NFPAs patients.

Download Full-text

Adversarial deconfounding autoencoder for learning robust gene expression embeddings

Bioinformatics ◽

10.1093/bioinformatics/btaa796 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i573-i582

Author(s):

Ayse B Dincer ◽

Joseph D Janizek ◽

Su-In Lee

Keyword(s):

Gene Expression ◽

Neural Networks ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Supplementary Information ◽

Biological Variables ◽

Large Numbers ◽

Latent Space ◽

Complex Models ◽

Unsupervised Neural Networks

Abstract Motivation Increasing number of gene expression profiles has enabled the use of complex models, such as deep unsupervised neural networks, to extract a latent space from these profiles. However, expression profiles, especially when collected in large numbers, inherently contain variations introduced by technical artifacts (e.g. batch effects) and uninteresting biological variables (e.g. age) in addition to the true signals of interest. These sources of variations, called confounders, produce embeddings that fail to transfer to different domains, i.e. an embedding learned from one dataset with a specific confounder distribution does not generalize to different distributions. To remedy this problem, we attempt to disentangle confounders from true signals to generate biologically informative embeddings. Results In this article, we introduce the Adversarial Deconfounding AutoEncoder (AD-AE) approach to deconfounding gene expression latent spaces. The AD-AE model consists of two neural networks: (i) an autoencoder to generate an embedding that can reconstruct original measurements, and (ii) an adversary trained to predict the confounder from that embedding. We jointly train the networks to generate embeddings that can encode as much information as possible without encoding any confounding signal. By applying AD-AE to two distinct gene expression datasets, we show that our model can (i) generate embeddings that do not encode confounder information, (ii) conserve the biological signals present in the original space and (iii) generalize successfully across different confounder domains. We demonstrate that AD-AE outperforms standard autoencoder and other deconfounding approaches. Availability and implementation Our code and data are available at https://gitlab.cs.washington.edu/abdincer/ad-ae. Contact Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text