Identification of deregulation mechanisms specific to cancer subtypes

In many cancers, mechanisms of gene regulation can be severely altered. Identification of deregulated genes, which do not follow the regulation processes that exist between transcription factors and their target genes, is of importance to better understand the development of the disease. We propose a methodology to detect deregulation mechanisms with a particular focus on cancer subtypes. This strategy is based on the comparison between tumoral and healthy cells. First, we use gene expression data from healthy cells to infer a reference gene regulatory network. Then, we compare it with gene expression levels in tumor samples to detect deregulated target genes. We finally measure the ability of each transcription factor to explain these deregulations. We apply our method on a public bladder cancer data set derived from The Cancer Genome Atlas project and confirm that it captures hallmarks of cancer subtypes. We also show that it enables the discovery of new potential biomarkers.

Download Full-text

Identification of Deregulated Transcription Factors Involved in Specific Bladder Cancer Subtypes

10.29007/v7qj ◽

2020 ◽

Author(s):

Magali Champion ◽

Julien Chiquet ◽

Pierre Neuvial ◽

Mohamed Elati ◽

François Radvanyi ◽

...

Keyword(s):

Gene Expression ◽

Bladder Cancer ◽

Transcription Factor ◽

Transcription Factors ◽

Target Genes ◽

The Cancer Genome Atlas ◽

Reference Network ◽

Data Set ◽

Cancer Subtypes ◽

Cancer Data

Comparison between tumoral and healthy cells may reveal abnormal regulation behaviors between a transcription factor and the genes it regulates, without exhibiting differential expression of the former genes. We propose a methodology for the identification of transcription factors involved in the deregulation of genes in tumoral cells. This strategy is based on the inference of a reference gene regulatory network that connects transcription factors to their downstream targets using gene expression data. Gene expression levels in tumor samples are then carefully compared to this reference network to detect deregulated target genes. A linear model is finally used to measure the ability of each transcription factor to explain these deregulations. We assess the performance of our method by numerical experiments on a public bladder cancer data set derived from the Cancer Genome Atlas project. We identify genes known for their implication in the development of specific bladder cancer subtypes as well as new potential biomarkers.

Download Full-text

Mining The Cancer Genome Atlas gene expression data for lineage markers in distinguishing bladder urothelial carcinoma and prostate adenocarcinoma

Scientific Reports ◽

10.1038/s41598-021-85993-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ewe Seng Ch’ng

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

The Cancer Genome Atlas ◽

Relative Importance ◽

Expression Data ◽

Gene Expressions ◽

Urothelial Carcinomas ◽

Cancer Genome Atlas ◽

Lineage Markers ◽

Genome Atlas

AbstractDistinguishing bladder urothelial carcinomas from prostate adenocarcinomas for poorly differentiated carcinomas derived from the bladder neck entails the use of a panel of lineage markers to help make this distinction. Publicly available The Cancer Genome Atlas (TCGA) gene expression data provides an avenue to examine utilities of these markers. This study aimed to verify expressions of urothelial and prostate lineage markers in the respective carcinomas and to seek the relative importance of these markers in making this distinction. Gene expressions of these markers were downloaded from TCGA Pan-Cancer database for bladder and prostate carcinomas. Differential gene expressions of these markers were analyzed. Standard linear discriminant analyses were applied to establish the relative importance of these markers in lineage determination and to construct the model best in making the distinction. This study shows that all urothelial lineage genes except for the gene for uroplakin III were significantly expressed in bladder urothelial carcinomas (p < 0.001). In descending order of importance to distinguish from prostate adenocarcinomas, genes for uroplakin II, S100P, GATA3 and thrombomodulin had high discriminant loadings (> 0.3). All prostate lineage genes were significantly expressed in prostate adenocarcinomas(p < 0.001). In descending order of importance to distinguish from bladder urothelial carcinomas, genes for NKX3.1, prostate specific antigen (PSA), prostate-specific acid phosphatase, prostein, and prostate-specific membrane antigen had high discriminant loadings (> 0.3). Combination of gene expressions for uroplakin II, S100P, NKX3.1 and PSA approached 100% accuracy in tumor classification both in the training and validation sets. Mining gene expression data, a combination of four lineage markers helps distinguish between bladder urothelial carcinomas and prostate adenocarcinomas.

Download Full-text

Explainable autoencoder-based representation learning for gene expression data

10.1101/2021.12.21.473742 ◽

2021 ◽

Author(s):

Yang Yu ◽

Pathum Kossinna ◽

Wenyuan Liao ◽

Qingrun Zhang

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Hidden Variables ◽

Representation Learning ◽

The Cancer Genome Atlas ◽

Expression Data ◽

Rna Seq ◽

Gene Expression Data Analysis ◽

Cancer Genome Atlas ◽

Modern Machine

Modern machine learning methods have been extensively utilized in gene expression data analysis. In particular, autoencoders (AE) have been employed in processing noisy and heterogenous RNA-Seq data. However, AEs usually lead to "black-box" hidden variables difficult to interpret, hindering downstream experimental validation and clinical translation. To bridge the gap between complicated models and biological interpretations, we developed a tool, XAE4Exp (eXplainable AutoEncoder for Expression data), which integrates AE and SHapley Additive exPlanations (SHAP), a flagship technique in the field of eXplainable AI (XAI). It quantitatively evaluates the contributions of each gene to the hidden structure learned by an AE, substantially improving the expandability of AE outcomes. By applying XAE4Exp to The Cancer Genome Atlas (TCGA) breast cancer gene expression data, we identified genes that are not differentially expressed, and pathways in various cancer-related classes. This tool will enable researchers and practitioners to analyze high-dimensional expression data intuitively, paving the way towards broader uses of deep learning.

Download Full-text

The Analysis of Gene Expression Data Incorporating Tumor Purity Information

Frontiers in Genetics ◽

10.3389/fgene.2021.642759 ◽

2021 ◽

Vol 12 ◽

Author(s):

Seungjun Ahn ◽

Tyler Grimes ◽

Somnath Datta

Keyword(s):

Gene Expression ◽

Tumor Cells ◽

Gene Expression Data ◽

The Cancer Genome Atlas ◽

Data Sets ◽

Expression Data ◽

Tumor Purity ◽

Robust Model ◽

Differential Network ◽

Cancer Genome Atlas

The tumor microenvironment is composed of tumor cells, stroma cells, immune cells, blood vessels, and other associated non-cancerous cells. Gene expression measurements on tumor samples are an average over cells in the microenvironment. However, research questions often seek answers about tumor cells rather than the surrounding non-tumor tissue. Previous studies have suggested that the tumor purity (TP)—the proportion of tumor cells in a solid tumor sample—has a confounding effect on differential expression (DE) analysis of high vs. low survival groups. We investigate three ways incorporating the TP information in the two statistical methods used for analyzing gene expression data, namely, differential network (DN) analysis and DE analysis. Analysis 1 ignores the TP information completely, Analysis 2 uses a truncated sample by removing the low TP samples, and Analysis 3 uses TP as a covariate in the underlying statistical models. We use three gene expression data sets related to three different cancers from the Cancer Genome Atlas (TCGA) for our investigation. The networks from Analysis 2 have greater amount of differential connectivity in the two networks than that from Analysis 1 in all three cancer datasets. Similarly, Analysis 1 identified more differentially expressed genes than Analysis 2. Results of DN and DE analyses using Analysis 3 were mostly consistent with those of Analysis 1 across three cancers. However, Analysis 3 identified additional cancer-related genes in both DN and DE analyses. Our findings suggest that using TP as a covariate in a linear model is appropriate for DE analysis, but a more robust model is needed for DN analysis. However, because true DN or DE patterns are not known for the empirical datasets, simulated datasets can be used to study the statistical properties of these methods in future studies.

Download Full-text

Abstract A46: A comprehensive genomic pan-cancer analysis comparing males and females using The Cancer Genome Atlas gene expression data

10.1158/1557-3265.pmccavuln16-a46 ◽

2017 ◽

Author(s):

YuanYuan Li ◽

David M. Umbach ◽

Leping Li

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Expression Data ◽

Cancer Genome Atlas ◽

Males And Females ◽

Pan Cancer ◽

Genome Atlas

Download Full-text

XGRN: Reconstruction of Biological Networks Based on Boosted Trees Regression

Computation ◽

10.3390/computation9040048 ◽

2021 ◽

Vol 9 (4) ◽

pp. 48

Author(s):

Georgios N. Dimitrakopoulos

Keyword(s):

Gene Expression ◽

Regression Model ◽

Gene Expression Data ◽

Biological Networks ◽

High Performance ◽

Regulatory Networks ◽

Target Genes ◽

Biological Information ◽

Expression Data ◽

Gene Regulatory

In Systems Biology, the complex relationships between different entities in the cells are modeled and analyzed using networks. Towards this aim, a rich variety of gene regulatory network (GRN) inference algorithms has been developed in recent years. However, most algorithms rely solely on gene expression data to reconstruct the network. Due to possible expression profile similarity, predictions can contain connections between biologically unrelated genes. Therefore, previously known biological information should also be considered by computational methods to obtain more consistent results, such as experimentally validated interactions between transcription factors and target genes. In this work, we propose XGBoost for gene regulatory networks (XGRN), a supervised algorithm, which combines gene expression data with previously known interactions for GRN inference. The key idea of our method is to train a regression model for each known interaction of the network and then utilize this model to predict new interactions. The regression is performed by XGBoost, a state-of-the-art algorithm using an ensemble of decision trees. In detail, XGRN learns a regression model based on gene expression of the two interactors and then provides predictions using as input the gene expression of other candidate interactors. Application on benchmark datasets and a real large single-cell RNA-Seq experiment resulted in high performance compared to other unsupervised and supervised methods, demonstrating the ability of XGRN to provide reliable predictions.

Download Full-text

Case-Based Retrieval Framework for Gene Expression Data

Cancer Informatics ◽

10.4137/cin.s22371 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S22371 ◽

Cited By ~ 4

Author(s):

Ali Anaissi ◽

Madhu Goyal ◽

Daniel R. Catchpoole ◽

Ali Braytee ◽

Paul J. Kennedy

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Gene Expression Data ◽

Childhood Leukemia ◽

Data Sets ◽

Expression Data ◽

Data Set ◽

Cancer Data ◽

Leukemia Data ◽

Case Based

Background The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process. Methods This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles. Results The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children's Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set. Conclusion The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.

Download Full-text

A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data

BMC Genomics ◽

10.1186/s12864-017-3906-0 ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 49

Author(s):

Yuanyuan Li ◽

Kai Kang ◽

Juno M. Krahn ◽

Nicole Croutwater ◽

Kevin Lee ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Cancer Genome ◽

Cancer Classification ◽

The Cancer Genome Atlas ◽

Expression Data ◽

Cancer Genome Atlas ◽

Pan Cancer ◽

Genome Atlas

Download Full-text

One-Step Robust Low-Rank Subspace Segmentation for Tumor Sample Clustering

Computational Intelligence and Neuroscience ◽

10.1155/2021/9990297 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Jian Liu ◽

Yuhu Cheng ◽

Xuesong Wang ◽

Shuguang Ge

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Low Rank ◽

Expression Data ◽

Clustering Methods ◽

Data Set ◽

Tumor Sample ◽

Cancer Subtypes ◽

Subspace Segmentation ◽

One Step

Clustering of tumor samples can help identify cancer types and discover new cancer subtypes, which is essential for effective cancer treatment. Although many traditional clustering methods have been proposed for tumor sample clustering, advanced algorithms with better performance are still needed. Low-rank subspace clustering is a popular algorithm in recent years. In this paper, we propose a novel one-step robust low-rank subspace segmentation method (ORLRS) for clustering the tumor sample. For a gene expression data set, we seek its lowest rank representation matrix and the noise matrix. By imposing the discrete constraint on the low-rank matrix, without performing spectral clustering, ORLRS learns the cluster indicators of subspaces directly, i.e., performing the clustering task in one step. To improve the robustness of the method, capped norm is adopted to remove the extreme data outliers in the noise matrix. Furthermore, we conduct an efficient solution to solve the problem of ORLRS. Experiments on several tumor gene expression data demonstrate the effectiveness of ORLRS.

Download Full-text

ComHub: Community predictions of hubs in gene regulatory networks

10.1101/840959 ◽

2019 ◽

Author(s):

Julia Åkesson ◽

Zelmina Lubovac-Pilav ◽

Rasmus Magnusson ◽

Mika Gustafsson

Keyword(s):

Gene Expression ◽

Gene Regulatory Networks ◽

Drug Targets ◽

Regulatory Networks ◽

Network Inference ◽

Target Genes ◽

Data Sets ◽

Data Set ◽

Gene Regulatory ◽

Inference Methods

AbstractSummaryHub transcription factors, regulating many target genes in gene regulatory networks (GRNs), play important roles as disease regulators and potential drug targets. However, while numerous methods have been developed to predict individual regulator-gene interactions from gene expression data, few methods focus on inferring these hubs. We have developed ComHub, a tool to predict hubs in GRNs. ComHub makes a community prediction of hubs by averaging over predictions by a compendium of network inference methods. Benchmarking ComHub to the DREAM5 challenge data and an independent data set of human gene expression, proved a robust performance of ComHub over all data sets. Lastly, we implemented ComHub to work with both predefined networks and to do standard network inference, which we believe will make it generally applicable.AvailabilityCode is available at https://gitlab.com/Gustafsson-lab/[email protected], [email protected]

Download Full-text