The expression tractability of a biological trait

Mapping Intimacies ◽

10.1101/278770 ◽

2018 ◽

Author(s):

Li Liu ◽

Jianguo Wang ◽

Jianrong Yang ◽

Xionglei He

Keyword(s):

Gene Expression ◽

Morphological Traits ◽

A Priori ◽

Thermodynamic System ◽

Expression Data ◽

Expression Trait ◽

Biological Trait ◽

Modern Molecular Biology ◽

Gene Modules ◽

Recurrent Patterns

AbstractUnderstanding how gene expression is translated to phenotype is central to modern molecular biology, but the success is contingent on the intrinsic tractability of the specific traits under examination. However, an a priori estimate of trait tractability from the perspective of gene expression is unavailable. Motivated by the concept of entropy in a thermodynamic system, we here propose such an estimate (ST) by gauging the number (N) of different expression states that underlie the same trait abnormality, with large ST corresponding to large N. By analyzing over 200 yeast morphological traits we show that ST is constrained by natural selection, which builds co-regulated gene modules to minimize the total number of possible expression states. We further show that ST is a good measure of the titer of recurrent patterns of an expression-trait relationship, predicting the extent to which the trait could be deterministically understood with gene expression data.

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

A priori, de novo mathematical exploration of gene expression mechanism via regression viewpoint with briefly cataloged modeling antiquity

International Journal of Biomathematics ◽

10.1142/s1793524517500061 ◽

2016 ◽

Vol 10 (01) ◽

pp. 1750006

Author(s):

Shaurya Jauhari ◽

S. A. M. Rizvi

Keyword(s):

Gene Expression ◽

De Novo ◽

A Priori ◽

Mathematical Framework ◽

Reaction Synthesis ◽

Expression Data ◽

Boolean Models ◽

Mathematical Exploration ◽

Mathematical Techniques ◽

Expression Mechanism

Various algorithms have been devised to mathematically model the dynamic mechanism of the gene expression data. Gillespie’s stochastic simulation (GSSA) has been exceptionally primal for chemical reaction synthesis with future ameliorations. Several other mathematical techniques such as differential equations, thermodynamic models and Boolean models have been implemented to optimally and effectively represent the gene functioning. We present a novel mathematical framework of gene expression, undertaking the mathematical modeling of the transcription and translation phases, which is a detour from conventional modeling approaches. These subprocesses are inherent to every gene expression, which is implicitly an experimental outcome. As we foresee, there can be modeled a generality about some basal translation or transcription values that correspond to a particular assay.

DNA Methylation Module Network-Based Prognosis and Molecular Typing of Cancer

Genes ◽

10.3390/genes10080571 ◽

2019 ◽

Vol 10 (8) ◽

pp. 571 ◽

Cited By ~ 4

Author(s):

Ze-Jia Cui ◽

Xiong-Hui Zhou ◽

Hong-Yu Zhang

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Molecular Typing ◽

Core Gene ◽

Cancer Prognosis ◽

Methylation Data ◽

Expression Data ◽

Module Network ◽

The Core ◽

Gene Modules

Achieving cancer prognosis and molecular typing is critical for cancer treatment. Previous studies have identified some gene signatures for the prognosis and typing of cancer based on gene expression data. Some studies have shown that DNA methylation is associated with cancer development, progression, and metastasis. In addition, DNA methylation data are more stable than gene expression data in cancer prognosis. Therefore, in this work, we focused on DNA methylation data. Some prior researches have shown that gene modules are more reliable in cancer prognosis than are gene signatures and that gene modules are not isolated. However, few studies have considered cross-talk among the gene modules, which may allow some important gene modules for cancer to be overlooked. Therefore, we constructed a gene co-methylation network based on the DNA methylation data of cancer patients, and detected the gene modules in the co-methylation network. Then, by permutation testing, cross-talk between every two modules was identified; thus, the module network was generated. Next, the core gene modules in the module network of cancer were identified using the K-shell method, and these core gene modules were used as features to study the prognosis and molecular typing of cancer. Our method was applied in three types of cancer (breast invasive carcinoma, skin cutaneous melanoma, and uterine corpus endometrial carcinoma). Based on the core gene modules identified by the constructed DNA methylation module networks, we can distinguish not only the prognosis of cancer patients but also use them for molecular typing of cancer. These results indicated that our method has important application value for the diagnosis of cancer and may reveal potential carcinogenic mechanisms.

Significant Shortest Paths For The Detection Of Putative Disease Modules

10.1101/2020.04.01.019844 ◽

2020 ◽

Author(s):

Daniele Pepe

Keyword(s):

Gene Expression ◽

Structural Equation ◽

Topological Analysis ◽

Shortest Paths ◽

Enrichment Analysis ◽

Equation Modeling ◽

Expression Data ◽

Gene Modules ◽

Disease Modules

AbstractBackgroundThe characterization of diseases in terms of perturbated gene modules was recently introduced for the analysis of gene expression data. Some approaches were proposed in literature, but many times they are inductive approaches. This means that starting directly from data, they try to infer key gene networks potentially associated to the biological phenomenon studied. However they ignore the biological information already available to characterize the gene modules. Here we propose the detection of perturbed gene modules using the combination of data driven and hypothesis-driven approaches relying on biological metabolic pathways and significant shortest paths tested by structural equation modeling.The procedure was tested on microarray experiments relative to infliximab response in patients with inflammatory bowel disease. Starting from differentially expressed genes (DEGs) and pathway analysis, significant shortest paths between DEGs were found and merged together. The validation of the final disease module was principally done by the comparison of genes in the module with those already associated with the disease, using the Wang similarity semantic index, and enrichment analysis based on Disease Ontology. Finally a topological analysis of the module via centrality measures and the identification of the cut vertices, allowed to unveil important nodes in the network as the TNF gene, and other potential drug target genes as p65 and PTPN6.ConclusionsHere we propose a downstream method for the characterization of disease modules from gene expression data. The core of the method is rooted on the identification of significant shortest paths between DEGs by structural equation modeling. This allows to have a mix approach based on data and biological knowledge enclosed in biological pathways. Other methods here described as enrichment analysis and topological analysis were functional to the validation of the procedure. The results obtained were promising, considering the genes and their connections found in the putative disease modules.

QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btz692 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1143-1149 ◽

Cited By ~ 9

Author(s):

Juan Xie ◽

Anjun Ma ◽

Yu Zhang ◽

Bingqiang Liu ◽

Sha Cao ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Gaussian Model ◽

Functional Gene ◽

Superior Performance ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Gene Modules

Abstract Motivation The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed. Results We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq. Availability and implementation The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2. Supplementary information Supplementary data are available at Bioinformatics online.

A resource for analyzing C. elegans’ gene expression data using transcriptional gene modules and module-weighted annotations

10.1101/678482 ◽

2019 ◽

Author(s):

Michael Cary ◽

Katie Podshivalova ◽

Cynthia Kenyon

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Patterns ◽

Expression Data ◽

Functional Interpretation ◽

C Elegans ◽

Experimental Organism ◽

Gene Modules ◽

Term Analysis ◽

Do So

AbstractIdentification of gene co-expression patterns (gene modules) is widely used for grouping functionally-related genes during transcriptomic data analysis. An organism-wide atlas of high quality fundamental gene modules would provide a powerful tool for unbiased detection of biological signals from gene expression data. Here, using a method of independent component analysis we call DEXICA, we have defined and optimized 209 modules that broadly represent transcriptional wiring of the key experimental organism C. elegans. Interrogation of these modules reveals processes that are activated in long-lived mutants in cases where traditional analyses of differentially-expressed genes fail to do so. Using this resource, users can easily identify active modules in their gene expression data and access detailed descriptions of each module. Additionally, we show that modules can inform the strength of the association between a gene and an annotation (e.g. GO term). Analysis of “module-weighted annotations” improves on several aspects of traditional annotation-enrichment tests and can aid in functional interpretation of poorly annotated genes. Interactive access to the resource is provided at http://genemodules.org/.

Network-based cancer genomic data integration for pattern discovery

BMC Genomic Data ◽

10.1186/s12863-021-01004-y ◽

2021 ◽

Vol 22 (S1) ◽

Author(s):

Fangfang Zhu ◽

Jiang Li ◽

Juan Liu ◽

Wenwen Min

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Profiles ◽

Mirna Gene ◽

Interaction Network ◽

Gene Interaction ◽

Functional Modules ◽

Expression Data ◽

Gene Interaction Network ◽

Gene Modules

Abstract Background Since genes involved in the same biological modules usually present correlated expression profiles, lots of computational methods have been proposed to identify gene functional modules based on the expression profiles data. Recently, Sparse Singular Value Decomposition (SSVD) method has been proposed to bicluster gene expression data to identify gene modules. However, this model can only handle the gene expression data where no gene interaction information is integrated. Ignoring the prior gene interaction information may produce the identified gene modules hard to be biologically interpreted. Results In this paper, we develop a Sparse Network-regularized SVD (SNSVD) method that integrates a prior gene interaction network from a protein protein interaction network and gene expression data to identify underlying gene functional modules. The results on a set of simulated data show that SNSVD is more effective than the traditional SVD-based methods. The further experiment results on real cancer genomic data show that most co-expressed modules are not only significantly enriched on GO/KEGG pathways, but also correspond to dense sub-networks in the prior gene interaction network. Besides, we also use our method to identify ten differentially co-expressed miRNA-gene modules by integrating matched miRNA and mRNA expression data of breast cancer from The Cancer Genome Atlas (TCGA). Several important breast cancer related miRNA-gene modules are discovered. Conclusions All the results demonstrate that SNSVD can overcome the drawbacks of SSVD and capture more biologically relevant functional modules by incorporating a prior gene interaction network. These identified functional modules may provide a new perspective to understand the diagnostics, occurrence and progression of cancer.

Clustering gene expression data using adaptive double self-organizing map

Physiological Genomics ◽

10.1152/physiolgenomics.00138.2002 ◽

2003 ◽

Vol 14 (1) ◽

pp. 35-46 ◽

Cited By ~ 15

Author(s):

Habtom Ressom ◽

Dali Wang ◽

Padma Natarajan

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Human Error ◽

A Priori ◽

Self Organizing Map ◽

Expression Data ◽

Number Of Clusters ◽

Model Based Clustering ◽

Free Parameters ◽

Self Organizing

This paper presents a novel clustering technique known as adaptive double self-organizing map (ADSOM). ADSOM has a flexible topology and performs clustering and cluster visualization simultaneously, thereby requiring no a priori knowledge about the number of clusters. ADSOM is developed based on a recently introduced technique known as double self-organizing map (DSOM). DSOM combines features of the popular self-organizing map (SOM) with two-dimensional position vectors, which serve as a visualization tool to decide how many clusters are needed. Although DSOM addresses the problem of identifying unknown number of clusters, its free parameters are difficult to control to guarantee correct results and convergence. ADSOM updates its free parameters during training, and it allows convergence of its position vectors to a fairly consistent number of clusters provided that its initial number of nodes is greater than the expected number of clusters. The number of clusters can be identified by visually counting the clusters formed by the position vectors after training. A novel index is introduced based on hierarchical clustering of the final locations of position vectors. The index allows automated detection of the number of clusters, thereby reducing human error that could be incurred from counting clusters visually. The reliance of ADSOM in identifying the number of clusters is proven by applying it to publicly available gene expression data from multiple biological systems such as yeast, human, and mouse. ADSOM’s performance in detecting number of clusters is compared with a model-based clustering method.

DSS: A biclustering method to identify diverse and state specific gene modules in gene expression data

2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC) ◽

10.1109/smc.2016.7844279 ◽

2016 ◽

Author(s):

Jungrim Kim ◽

Yunku Yeu ◽

Jeongwoo Kim ◽

Youngmi Yoon ◽

Sanghyun Park

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Specific Gene ◽

Expression Data ◽

Gene Modules

Biomarker Screening And Prediction Model Construction of Esophageal Carcinoma Based On Bioinformatics

10.21203/rs.3.rs-915949/v1 ◽

2021 ◽

Author(s):

Yanzhou Zhang ◽

Qing Zhu ◽

Xiufeng Cao ◽

Bin Ni

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Differentially Expressed Genes ◽

Cross Validation ◽

Prediction Models ◽

Differentially Expressed ◽

Expression Data ◽

Rna Seq ◽

Gene Modules ◽

Fold Cross Validation

Abstract Background and objective: Esophageal cancer(ESCA) ranks eleventh in incidence and eighth in mortality among malignant tumors in the world. Due to the paucity of effective early diagnostic approach, a lot of patients have missed the first-rank treatment time frame and were already in the advanced phase at their first diagnosis. The continuous reforming of high-throughput sequencing technologies and analytical techniques has provided novel concepts and approaches for the study of cancer biomarkers in esophageal cancer. The development of cancer is a complex biological process with multi-gene concernment, multi-factor mutual effect and multi-phase development. This process includes the mutations in proto-oncogenes, changes in transcript expression profiles, and abnormalities of protein structure, function, or expression levels. The study of the molecular mechanism of ESCA using high-throughput sequencing technology will lay theoretic foundation for the early diagnosis and targeted therapy of ESCA.Materials and methods: In this study, a search was conducted in tow commonly used public databases, UCSC XENA and GEO, one UCSC XENA RNA-seq data and tow GEO datasets were included in this study. Differential expression analysis was implemented by using limma in R software.Weighted gene co-expression network analysis (WGCNA) was used to analyze the gene transcriptome expression profile consisting of 181 ESCA tissues and 181 normal tissues as controls to construct topology network. We constructed gene modules and searched for gene modules that were closely participant to ESCA, and gene ontology (GO) and KEGG pathway enrichment analysis were implemented to probe into the functions of the DEGs and differentially expressed hub genes in key modules. By combining the consequences of differential gene expression analysis with WGCNA consequences(hub genes), we procured a 30 of differentially expressed genes in module that were closely participant to ESCA. Next, we procured the expression data of these genes from normalized transcriptome expression data to construct ESCA predictive model. Then, ten-fold cross validation combining with machine learning algorithms were used to construct prediction models for ESCA. Finally, we also verified the four screened biomarkers which used to build the predictive model with the GEO data sets.Results: Analysis of differentially expressed genes were conducted by using the limma packages and differentially expressed genes were defined as |log2FC|>1 and adj.P.Val < 0.01. After comparison the results from limma, a total of 15814 genes were up-regulated in ESCA, a total of 6176 gene were down-regulated in ESCA.A total of 7 gene modules were identified from WGCNA, 2 modules of them are strongly corelative with ESCA (Brown module: R2=0.87, Lightcyan module: R2=-0.75, both P <0.001). Brown module is closely related to ESCA.The consequences of WGCNA analysis combined with differentially expressed genes revealed that there were 4419 differentially expressed genes in the brown module which were closely related to ESCA. 30 hub gene were screened by kWithin top 30 from brown module, and all of them are differentially expressed.GO analysis of differetially expressed genes from brown module revealed that these genes are from immunoglobulin complex, “chromosome, centromeric region”, condensed chromosome, “immunoglobulin complex, circulating”, condensed chromosome, centromeric region, and other components, and they participated in biological function such as antigen binding, immunoglobulin receptor binding, ATPase activity, cadherin binding, DNA helicase activity, etc., involved in biological processes such as adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains, mitotic nuclear division, lymphocyte mediated immunity, nuclear division, and DNA replication; KEGG pathway analysis shows the brown module differentially expressed genes are mainly enriched in signal pathways such as cell cycle, pathogenic escherichia coli infection, DNA replication, IL-17 signaling pathway and human T-cell leukemia virus 1 infection. This shed new light on molecular mechanisms of the development of ESCA.Twelve ESCA prediction models constructed from 30 gene expression matrices from 362 subjects by using 10-fold cross-validation combined with machine learning algorithms revealed good prediction performance in validation dataset, among which models from gbm, BoostGLM, C5.0 algorithms revealed higher accuracy than from other algorithms. Although the transparent or semi-transparent models constructed by JRip, PART, and Rpart algorithms have acceptable accuracy in validation dataset, their sensitivity are lower. From a comprehensive perspective, two black box algorithm models including gbm and BoostGLM models are selected as the final model. This study has successfully constructed ESCA prediction models with accuracies higher than 0.97.Finally, three of the four screened biomarkers were validated.Conclusions: In current study, differential expression analysis and WGCNA of ESCA participant RNA-seq data available in public database were used to screen DEGs and genes that were closely participant with ESCA. Consequences from GO and KEGG analysis further revealed the underlying mechanisms of ESCA. Normalized gene expression data was feed to several different machine learning techniques and 10-fold cross validation was used to construct high accuracy ESCA predictive models. Eventually, several ESCA predictive models with accuracy higher than 0.96 in validation group were constructed. At the meantime, three biomarkers(G3BP1, CHEK1 and MOB1A) were screened and validated, in particular, G3BP1 may be a potential therapeutic target, as overall survival analysis have shown it to be an adverse prognostic factor. Current study has lay the basis of applying RNA-seq data in the early genetic diagnosis of ESCA, and a prognostic marker that might contribute to treatment of ESCA.