Smoothing Gene Expression Data with Network Information Improves Consistency of Regulated Genes

Gene set analysis methods have become a widely used tool for including prior biological knowledge in the statistical analysis of gene expression data. Advantages of these methods include increased sensitivity, easier interpretation and more conformity in the results. However, gene set methods do not employ all the available information about gene relations. Genes are arranged in complex networks where the network distances contain detailed information about inter-gene dependencies. We propose a method that uses gene networks to smooth gene expression data with the aim of reducing the number of false positives and identify important subnetworks. Gene dependencies are extracted from the network topology and are used to smooth genewise test statistics. To find the optimal degree of smoothing, we propose using a criterion that considers the correlation between the network and the data. The network smoothing is shown to improve the ability to identify important genes in simulated data. Applied to a real data set, the smoothing accentuates parts of the network with a high density of differentially expressed genes.

Download Full-text

graphsim: An R package for simulating gene expression data from graph structures of biological pathways

10.1101/2020.03.02.972471 ◽

2020 ◽

Author(s):

S. Thomas Kelly ◽

Michael A. Black

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Regulatory Networks ◽

Large Scale ◽

Simulated Data ◽

R Package ◽

Biological Pathways ◽

Graph Structure ◽

Expression Data

SummaryTranscriptomic analysis is used to capture the molecular state of a cell or sample in many biological and medical applications. In addition to identifying alterations in activity at the level of individual genes, understanding changes in the gene networks that regulate fundamental biological mechanisms is also an important objective of molecular analysis. As a result, databases that describe biological pathways are increasingly uesad to assist with the interpretation of results from large-scale genomics studies. Incorporating information from biological pathways and gene regulatory networks into a genomic data analysis is a popular strategy, and there are many methods that provide this functionality for gene expression data. When developing or comparing such methods, it is important to gain an accurate assessment of their performance. Simulation-based validation studies are frequently used for this. This necessitates the use of simulated data that correctly accounts for pathway relationships and correlations. Here we present a versatile statistical framework to simulate correlated gene expression data from biological pathways, by sampling from a multivariate normal distribution derived from a graph structure. This procedure has been released as the graphsim R package on CRAN and GitHub (https://github.com/TomKellyGenetics/graphsim) and is compatible with any graph structure that can be described using the igraph package. This package allows the simulation of biological pathways from a graph structure based on a statistical model of gene expression.

Download Full-text

Efficient Proximal Gradient Algorithm for Inference of Differential Gene Networks

10.1101/450130 ◽

2018 ◽

Author(s):

Chen Wang ◽

Feng Gao ◽

Georgios B. Giannakis ◽

Gennaro D’Urso ◽

Xiaodong Cai

Keyword(s):

Gene Expression ◽

Computer Simulations ◽

Gene Expression Data ◽

Gene Networks ◽

Gene Set Enrichment Analysis ◽

Gradient Algorithm ◽

Superior Performance ◽

Expression Data ◽

Gene Set ◽

Proximal Gradient Algorithm

AbstractBackgroundGene networks in living cells can change depending on various conditions such as caused by different environments, tissue types, disease states, and development stages. Identifying the differential changes in gene networks is very important to understand molecular basis of various biological process. While existing algorithms can be used to infer two gene networks separately from gene expression data under two different conditions, and then to identify network changes, such an approach does not exploit the data jointly, and it is thus suboptimal. A desirable approach would be clearly to infer two gene networks jointly, which can yield improved estimates of network changes.ResultsIn this paper, we developed a proximal gradient algorithm for differential network (ProGAdNet) inference, that jointly infers two gene networks under different conditions and then identifies changes in the network structure. Computer simulations demonstrated that our ProGAdNet outperformed existing algorithms in terms of inference accuracy, and was much faster than a similar approach for joint inference of gene networks. Gene expression data of breast tumors and normal tissues in the TCGA database were analyzed with our ProGAdNet, and revealed that 268 genes were involved in the changed network edges. Gene set enrichment analysis of this set of 268 genes identified a number of gene sets related to breast cancer or other types of cancer, which corroborated the gene set identified by ProGAdNet was very informative about the cancer disease status. A software package implementing the ProGAdNet and computer simulations is available upon request.ConclusionWith its superior performance over existing algorithms, ProGAdNet provides a valuable tool for finding changes in gene networks, which may aid the discovery of gene-gene interactions changed under different conditions.

Download Full-text

Clustering of gene expression profiles: creating initialization-independent clusterings by eliminating unstable genes

Journal of Integrative Bioinformatics ◽

10.1515/jib-2010-134 ◽

2010 ◽

Vol 7 (3) ◽

Author(s):

Wim De Mulder ◽

Martin Kuiper ◽

René Boel

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Profiles ◽

Clustering Algorithms ◽

Gene Expression Profiles ◽

Biological Data ◽

Biological Knowledge ◽

Expression Data ◽

Data Set ◽

Cluster Membership

SummaryClustering is an important approach in the analysis of biological data, and often a first step to identify interesting patterns of coexpression in gene expression data. Because of the high complexity and diversity of gene expression data, many genes cannot be easily assigned to a cluster, but even if the dissimilarity of these genes with all other gene groups is large, they will finally be forced to become member of a cluster. In this paper we show how to detect such elements, called unstable elements. We have developed an approach for iterative clustering algorithms in which unstable elements are deleted, making the iterative algorithm less dependent on initial centers. Although the approach is unsupervised, it is less likely that the clusters into which the reduced data set is subdivided contain false positives. This clustering yields a more differentiated approach for biological data, since the cluster analysis is divided into two parts: the pruned data set is divided into highly consistent clusters in an unsupervised way and the removed, unstable elements for which no meaningful cluster exists in unsupervised terms can be given a cluster with the use of biological knowledge and information about the likelihood of cluster membership. We illustrate our framework on both an artificial and real biological data set.

Download Full-text

COMBINING MICROARRAYS AND BIOLOGICAL KNOWLEDGE FOR ESTIMATING GENE NETWORKS VIA BAYESIAN NETWORKS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972000400048x ◽

2004 ◽

Vol 02 (01) ◽

pp. 77-98 ◽

Cited By ~ 64

Author(s):

SEIYA IMOTO ◽

TOMOYUKI HIGUCHI ◽

TAKAO GOTO ◽

KOUSUKE TASHIRO ◽

SATORU KUHARA ◽

...

Keyword(s):

Gene Expression ◽

Bayesian Networks ◽

Gene Expression Data ◽

Protein Interactions ◽

Gene Networks ◽

Estimation Method ◽

Microarray Gene Expression Data ◽

Biological Knowledge ◽

Expression Data ◽

Microarray Gene Expression

We propose a statistical method for estimating a gene network based on Bayesian networks from microarray gene expression data together with biological knowledge including protein-protein interactions, protein-DNA interactions, binding site information, existing literature and so on. Microarray data do not contain enough information for constructing gene networks accurately in many cases. Our method adds biological knowledge to the estimation method of gene networks under a Bayesian statistical framework, and also controls the trade-off between microarray information and biological knowledge automatically. We conduct Monte Carlo simulations to show the effectiveness of the proposed method. We analyze Saccharomyces cerevisiae gene expression data as an application.

Download Full-text

Inferring Differential Networks by Integrating Gene Expression Data With Additional Knowledge

Frontiers in Genetics ◽

10.3389/fgene.2021.760155 ◽

2021 ◽

Vol 12 ◽

Author(s):

Chen Liu ◽

Dehan Cai ◽

WuCha Zeng ◽

Yun Huang

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Ovarian Cancer ◽

Gene Expression Data ◽

Gene Networks ◽

Biological Knowledge ◽

Platinum Resistance ◽

Expression Data ◽

Androgen Resistance ◽

Differential Networks

Evidences increasingly indicate the involvement of gene network rewiring in disease development and cell differentiation. With the accumulation of high-throughput gene expression data, it is now possible to infer the changes of gene networks between two different states or cell types via computational approaches. However, the distribution diversity of multi-platform gene expression data and the sparseness and high noise rate of single-cell RNA sequencing (scRNA-seq) data raise new challenges for existing differential network estimation methods. Furthermore, most existing methods are purely rely on gene expression data, and ignore the additional information provided by various existing biological knowledge. In this study, to address these challenges, we propose a general framework, named weighted joint sparse penalized D-trace model (WJSDM), to infer differential gene networks by integrating multi-platform gene expression data and multiple prior biological knowledge. Firstly, a non-paranormal graphical model is employed to tackle gene expression data with missing values. Then we propose a weighted group bridge penalty to integrate multi-platform gene expression data and various existing biological knowledge. Experiment results on synthetic data demonstrate the effectiveness of our method in inferring differential networks. We apply our method to the gene expression data of ovarian cancer and the scRNA-seq data of circulating tumor cells of prostate cancer, and infer the differential network associated with platinum resistance of ovarian cancer and anti-androgen resistance of prostate cancer. By analyzing the estimated differential networks, we find some important biological insights about the mechanisms underlying platinum resistance of ovarian cancer and anti-androgen resistance of prostate cancer.

Download Full-text

CancerInSilico: An R/Bioconductor package for combining mathematical and statistical modeling to simulate time course bulk and single cell gene expression data in cancer

10.1101/328807 ◽

2018 ◽

Author(s):

Thomas D Sherman ◽

Luciane T Kagohara ◽

Raymon Cao ◽

Raymond Cheng ◽

Matthew Satriano ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Time Course ◽

Ground Truth ◽

Real Data ◽

Cellular Systems ◽

Expression Data ◽

Bioconductor Package ◽

Data Set

AbstractBioinformatics techniques to analyze time course bulk and single cell omics data are advancing. The absence of a known ground truth of the dynamics of molecular changes challenges benchmarking their performance on real data. Realistic simulated time-course datasets are essential to assess the performance of time course bioinformatics algorithms. We develop an R/Bioconductor package, CancerInSilico, to simulate bulk and single cell transcriptional data from a known ground truth obtained from mathematical models of cellular systems. This package contains a general R infrastructure for running cell-based models and simulating gene expression data based on the model states. We show how to use this package to simulate a gene expression data set and consequently benchmark analysis methods on this data set with a known ground truth. The package is freely available via Bioconductor: http://bioconductor.org/packages/CancerInSilico/

Download Full-text

Gene Set Correlation Analysis and Visualization Using Gene Expression Data

Current Bioinformatics ◽

10.2174/1574893615999200629124444 ◽

2020 ◽

Vol 15 ◽

Author(s):

Chen-An Tsai ◽

James J. Chen

Keyword(s):

Gene Expression ◽

Correlation Analysis ◽

Gene Expression Data ◽

Differentially Expressed Gene ◽

Differentially Expressed ◽

Superior Performance ◽

Expression Data ◽

Gene Set ◽

Gene Sets ◽

Set Correlation

Background: Gene set enrichment analyses (GSEA) provide a useful and powerful approach to identify differentially expressed gene sets with prior biological knowledge. Several GSEA algorithms have been proposed to perform enrichment analyses on groups of genes. However, many of these algorithms have focused on identification of differentially expressed gene sets in a given phenotype. Objective: In this paper, we propose a gene set analytic framework, Gene Set Correlation Analysis (GSCoA), that simultaneously measures within and between gene sets variation to identify sets of genes enriched for differential expression and highly co-related pathways. Methods: We apply co-inertia analysis to the comparisons of cross-gene sets in gene expression data to measure the costructure of expression profiles in pairs of gene sets. Co-inertia analysis (CIA) is one multivariate method to identify trends or co-relationships in multiple datasets, which contain the same samples. The objective of CIA is to seek ordinations (dimension reduction diagrams) of two gene sets such that the square covariance between the projections of the gene sets on successive axes is maximized. Simulation studies illustrate that CIA offers superior performance in identifying corelationships between gene sets in all simulation settings when compared to correlation-based gene set methods. Result and Conclusion: We also combine between-gene set CIA and GSEA to discover the relationships between gene sets significantly associated with phenotypes. In addition, we provide a graphical technique for visualizing and simultaneously exploring the associations of between and within gene sets and their interaction and network. We then demonstrate integration of within and between gene sets variation using CIA and GSEA, applied to the p53 gene expression data using the c2 curated gene sets. Ultimately, the GSCoA approach provides an attractive tool for identification and visualization of novel associations between pairs of gene sets by integrating co-relationships between gene sets into gene set analysis.

Download Full-text

On the limits of active module identification

Briefings in Bioinformatics ◽

10.1093/bib/bbab066 ◽

2021 ◽

Author(s):

Olga Lazareva ◽

Jan Baumbach ◽

Markus List ◽

David B Blumenthal

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Small Diameter ◽

Extensive Study ◽

Biological Knowledge ◽

Expression Data ◽

Module Identification ◽

Ppi Networks ◽

Novel Algorithms ◽

Context Specific

Abstract In network and systems medicine, active module identification methods (AMIMs) are widely used for discovering candidate molecular disease mechanisms. To this end, AMIMs combine network analysis algorithms with molecular profiling data, most commonly, by projecting gene expression data onto generic protein–protein interaction (PPI) networks. Although active module identification has led to various novel insights into complex diseases, there is increasing awareness in the field that the combination of gene expression data and PPI network is problematic because up-to-date PPI networks have a very small diameter and are subject to both technical and literature bias. In this paper, we report the results of an extensive study where we analyzed for the first time whether widely used AMIMs really benefit from using PPI networks. Our results clearly show that, except for the recently proposed AMIM DOMINO, the tested AMIMs do not produce biologically more meaningful candidate disease modules on widely used PPI networks than on random networks with the same node degrees. AMIMs hence mainly learn from the node degrees and mostly fail to exploit the biological knowledge encoded in the edges of the PPI networks. This has far-reaching consequences for the field of active module identification. In particular, we suggest that novel algorithms are needed which overcome the degree bias of most existing AMIMs and/or work with customized, context-specific networks instead of generic PPI networks.

Download Full-text