A priori, de novo mathematical exploration of gene expression mechanism via regression viewpoint with briefly cataloged modeling antiquity

Various algorithms have been devised to mathematically model the dynamic mechanism of the gene expression data. Gillespie’s stochastic simulation (GSSA) has been exceptionally primal for chemical reaction synthesis with future ameliorations. Several other mathematical techniques such as differential equations, thermodynamic models and Boolean models have been implemented to optimally and effectively represent the gene functioning. We present a novel mathematical framework of gene expression, undertaking the mathematical modeling of the transcription and translation phases, which is a detour from conventional modeling approaches. These subprocesses are inherent to every gene expression, which is implicitly an experimental outcome. As we foresee, there can be modeled a generality about some basal translation or transcription values that correspond to a particular assay.

Download Full-text

Supervised classification for gene network reconstruction

Biochemical Society Transactions ◽

10.1042/bst0311497 ◽

2003 ◽

Vol 31 (6) ◽

pp. 1497-1502 ◽

Cited By ~ 11

Author(s):

L.A. Soinov

Keyword(s):

Gene Expression ◽

Gene Networks ◽

Supervised Classification ◽

Biochemical Network ◽

Expression Data ◽

Clustering Methods ◽

Boolean Models ◽

Experimental Conditions ◽

Expression Levels ◽

Mathematical Techniques

One of the central problems of functional genomics is revealing gene expression networks – the relationships between genes that reflect observations of how the expression level of each gene affects those of others. Microarray data are currently a major source of information about the interplay of biochemical network participants in living cells. Various mathematical techniques, such as differential equations, Bayesian and Boolean models and several statistical methods, have been applied to expression data in attempts to extract the underlying knowledge. Unsupervised clustering methods are often considered as the necessary first step in visualization and analysis of the expression data. As for supervised classification, the problem mainly addressed so far has been how to find discriminative genes separating various samples or experimental conditions. Numerous methods have been applied to identify genes that help to predict treatment outcome or to confirm a diagnosis, as well as to identify primary elements of gene regulatory circuits. However, less attention has been devoted to using supervised learning to uncover relationships between genes and/or their products. To start filling this gap a machine-learning approach for gene networks reconstruction is described here. This approach is based on building classifiers – functions, which determine the state of a gene's transcription machinery through expression levels of other genes. The method can be applied to various cases where relationships between gene expression levels could be expected.

Download Full-text

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

Download Full-text

Finding Influential Genes Using Gene Expression Data and Boolean Models of Metabolic Networks

2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE) ◽

10.1109/bibe.2016.25 ◽

2016 ◽

Author(s):

Takeyuki Tamura ◽

Tatsuya Akutsu ◽

Chun-Yu Lin ◽

Jinn-Moon Yang

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Metabolic Networks ◽

Expression Data ◽

Boolean Models

Download Full-text

ADAPTS: Automated Deconvolution Augmentation of Profiles for Tissue Specific cells

10.1101/633958 ◽

2019 ◽

Author(s):

Samuel A Danziger ◽

David L Gibbs ◽

Ilya Shmulevich ◽

Mark McConnell ◽

Matthew WB Trotter ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Immune Cell ◽

De Novo ◽

Cell Types ◽

Expression Data ◽

Cell Type ◽

Data Set ◽

Rnaseq Data

AbstractImmune cell infiltration of tumors can be an important component for determining patient outcomes, e.g. by inferring immune cell presence by deconvolving gene expression data drawn from a heterogenous mix of cell types. One particularly powerful family of deconvolution techniques uses signature matrices of genes that uniquely identify each cell type as determined from cell type purified gene expression data. Many methods of this type have been recently published, often including new signature matrices appropriate for a single purpose, such as investigating a specific type of tumor. The package ADAPTS helps users make the most of this expanding knowledge base by introducing a framework for cell type deconvolution. ADAPTS implements modular tools for customizing signature matrices for new tissue types by adding custom cell types or building new matrices de novo, including from single cell RNAseq data. It includes a common interface to several popular deconvolution algorithms that use a signature matrix to estimate the proportion of cell types present in heterogenous samples. ADAPTS also implements a novel method for clustering cell types into groups that are hard to distinguish by deconvolution and then re-splitting those clusters using hierarchical deconvolution. We demonstrate that the techniques implemented in ADAPTS improve the ability to reconstruct the cell types present in a single cell RNAseq data set in a blind predictive analysis. ADAPTS is currently available for use in R on CRAN and GitHub.

Download Full-text

Use of relevancy and complementary information for discriminatory gene selection from high-dimensional gene expression data

PLoS ONE ◽

10.1371/journal.pone.0230164 ◽

2021 ◽

Vol 16 (10) ◽

pp. e0230164

Author(s):

Md Nazmul Haque ◽

Sadia Sharmin ◽

Amin Ahsan Ali ◽

Abu Ashfaqur Sajib ◽

Mohammad Shoyaib

Keyword(s):

Gene Expression ◽

Mutual Information ◽

Gene Expression Data ◽

Gene Selection ◽

De Novo ◽

Expression Profiles ◽

Biological Data ◽

Expression Data ◽

Finite Sample ◽

Key Genes

With the advent of high-throughput technologies, life sciences are generating a huge amount of varied biomolecular data. Global gene expression profiles provide a snapshot of all the genes that are transcribed in a cell or in a tissue under a particular condition. The high-dimensionality of such gene expression data (i.e., very large number of features/genes analyzed with relatively much less number of samples) makes it difficult to identify the key genes (biomarkers) that are truly attributing to a particular phenotype or condition, (such as cancer), de novo. For identifying the key genes from gene expression data, among the existing literature, mutual information (MI) is one of the most successful criteria. However, the correction of MI for finite sample is not taken into account in this regard. It is also important to incorporate dynamic discretization of genes for more relevant gene selection, although this is not considered in the available methods. Besides, it is usually suggested in current studies to remove redundant genes which is particularly inappropriate for biological data, as a group of genes may connect to each other for downstreaming proteins. Thus, despite being redundant, it is needed to add the genes which provide additional useful information for the disease. Addressing these issues, we proposed Mutual information based Gene Selection method (MGS) for selecting informative genes. Moreover, to rank these selected genes, we extended MGS and propose two ranking methods on the selected genes, such as MGSf—based on frequency and MGSrf—based on Random Forest. The proposed method not only obtained better classification rates on gene expression datasets derived from different gene expression studies compared to recently reported methods but also detected the key genes relevant to pathways with a causal relationship to the disease, which indicate that it will also able to find the responsible genes for an unknown disease data.

Download Full-text

Systems analyses of key metabolic modules of floral and extrafloral nectaries of cotton

10.1101/857771 ◽

2019 ◽

Author(s):

Elizabeth C. Chatt ◽

Siti-Nabilla Mahalim ◽

Nur-Aziatull Mohd-Fadzil ◽

Rahul Roy ◽

Peter M. Klinkenberg ◽

...

Keyword(s):

Gene Expression ◽

Gossypium Hirsutum ◽

Gene Expression Data ◽

Plasma Membranes ◽

De Novo ◽

Starch Synthesis ◽

Extrafloral Nectaries ◽

Starch Degradation ◽

Expression Data ◽

Ideal System

AbstractNectar is a primary reward mediating plant-animal mutualisms to improve plant fitness and reproductive success. In Gossypium hirsutum (cotton), four distinct trichomatic nectaries develop, one floral and three extrafloral. The secreted floral and extrafloral nectars serve different purposes, with the floral nectar attracting bees to promote pollination and the extrafloral nectar attracting predatory insects as a means of indirect resistance from herbivores. Cotton therefore provides an ideal system to contrast mechanisms of nectar production and nectar composition between floral and extrafloral nectaries. Here, we report the transcriptome, ultrastructure, and metabolite spatial distribution using mass spectrometric imaging of the four cotton nectary types throughout development. Additionally, the secreted nectar metabolomes were defined and were jointly composed of 197 analytes, 60 of which were identified. Integration of theses datasets support the coordination of merocrine-based and eccrine-based models of nectar synthesis. The nectary ultrastructure supports the merocrine-based model due to the abundance of rough endoplasmic reticulum positioned parallel to the cell walls and profusion of vesicles fusing to the plasma membranes. The eccrine-based model which consist of a progression from starch synthesis to starch degradation and to sucrose biosynthesis was supported by gene expression data. This demonstrates conservation of the eccrine-based model for the first time in both trichomatic and extrafloral nectaries. Lastly, nectary gene expression data provided evidence to support de novo synthesis of amino acids detected in the secreted nectars.One sentence summaryThe eccrine-based model of nectar synthesis and secretion is conserved in both trichomatic and extrafloral nectaries determined by a system-based comparison of cotton (Gossypium hirsutum) nectaries.

Download Full-text

Clustering gene expression data using adaptive double self-organizing map

Physiological Genomics ◽

10.1152/physiolgenomics.00138.2002 ◽

2003 ◽

Vol 14 (1) ◽

pp. 35-46 ◽

Cited By ~ 15

Author(s):

Habtom Ressom ◽

Dali Wang ◽

Padma Natarajan

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Human Error ◽

A Priori ◽

Self Organizing Map ◽

Expression Data ◽

Number Of Clusters ◽

Model Based Clustering ◽

Free Parameters ◽

Self Organizing

This paper presents a novel clustering technique known as adaptive double self-organizing map (ADSOM). ADSOM has a flexible topology and performs clustering and cluster visualization simultaneously, thereby requiring no a priori knowledge about the number of clusters. ADSOM is developed based on a recently introduced technique known as double self-organizing map (DSOM). DSOM combines features of the popular self-organizing map (SOM) with two-dimensional position vectors, which serve as a visualization tool to decide how many clusters are needed. Although DSOM addresses the problem of identifying unknown number of clusters, its free parameters are difficult to control to guarantee correct results and convergence. ADSOM updates its free parameters during training, and it allows convergence of its position vectors to a fairly consistent number of clusters provided that its initial number of nodes is greater than the expected number of clusters. The number of clusters can be identified by visually counting the clusters formed by the position vectors after training. A novel index is introduced based on hierarchical clustering of the final locations of position vectors. The index allows automated detection of the number of clusters, thereby reducing human error that could be incurred from counting clusters visually. The reliance of ADSOM in identifying the number of clusters is proven by applying it to publicly available gene expression data from multiple biological systems such as yeast, human, and mouse. ADSOM’s performance in detecting number of clusters is compared with a model-based clustering method.

Download Full-text

Complex Networks, Gene Expression and Cancer Complexity: A Brief Review of Methodology and Applications

Current Bioinformatics ◽

10.2174/1574893614666191017093504 ◽

2020 ◽

Vol 15 (6) ◽

pp. 629-655

Author(s):

A.C. Iliopoulos ◽

G. Beis ◽

P. Apostolou ◽

I. Papasotiriou

Keyword(s):

Gene Expression ◽

Cancer Research ◽

Complex Networks ◽

Gene Network ◽

Network Inference ◽

Mathematical Framework ◽

Cancer Gene ◽

Expression Data ◽

Gene Network Inference ◽

Basic Features

In this brief survey, various aspects of cancer complexity and how this complexity can be confronted using modern complex networks’ theory and gene expression datasets, are described. In particular, the causes and the basic features of cancer complexity, as well as the challenges it brought are underlined, while the importance of gene expression data in cancer research and in reverse engineering of gene co-expression networks is highlighted. In addition, an introduction to the corresponding theoretical and mathematical framework of graph theory and complex networks is provided. The basics of network reconstruction along with the limitations of gene network inference, the enrichment and survival analysis, evolution, robustness-resilience and cascades in complex networks, are described. Finally, an indicative and suggestive example of a cancer gene co-expression network inference and analysis is given.

Download Full-text

Computing Minimal Boolean Models of Gene Regulatory Networks

10.1101/2021.05.22.445266 ◽

2021 ◽

Author(s):

Guy Karlebach ◽

Peter N Robinson

Keyword(s):

Gene Expression ◽

Gene Regulatory Networks ◽

Regulatory Networks ◽

A Priori ◽

Linear Constraints ◽

Boolean Models ◽

Real World Data ◽

Gene Regulatory ◽

Network States ◽

Logical Rules

Models of Gene Regulatory Networks (GRNs) capture the dynamics of the regulatory processes that occur within the cell as a means to understand the variability observed in gene expression between different conditions. Possibly the simplest mathematical construct used for modeling is the Boolean network, which dictates a set of logical rules for transition between states described as Boolean vectors. Due to the complexity of gene regulation and the limitations of experimental technologies, in most cases knowledge about regulatory interactions and Boolean states is partial. In addition, the logical rules themselves are not known a-priori. Our goal in this work is to present a methodology for inferring this information from the data, and to provide a measure for comparing network states under different biological conditions. Methods: We present a novel methodology for integrating experimental data and performing a search for the optimal consistent structure via optimization of a linear objective function under a set of linear constraints. We also present a statistical approach for testing the similarity of network states under different conditions. Results: Our methodology finds the optimal model using an experimental gene expression dataset from human CD4 T-cells and shows that network states are different between healthy controls and rheumatoid arthritis patients. Conclusion: The problem can be solved optimally using real-world data. Properties of the inferred network show the importance of a general approach. Significance: Our methodology will enable researchers to obtain a better understanding of the function of gene regulatory networks and their biological role.

Download Full-text

BICORN: An R package for integrative inference of de novo cis-regulatory modules

10.1101/560557 ◽

2019 ◽

Author(s):

Xi Chen

Keyword(s):

Gene Expression ◽

Transcription Factor ◽

Gene Expression Data ◽

Regulatory Networks ◽

Target Genes ◽

De Novo ◽

R Package ◽

Expression Data ◽

Regulatory Modules ◽

Regulatory Module

AbstractBICORN is an R package developed to integrate prior transcription factor binding information and gene expression data for cis-regulatory module (CRM) inference. BICORN searches for a list of candidate CRMs from binary bindings on potential target genes. Applying Gibbs sampling, BICORN samples CRMs for each gene using the fitting performance of transcription factor activities and regulation strengths of TFs in each CRM on gene expression. Consequently, sparse regulatory networks are inferred as functional CRMs regulating target genes. The BICORN package is implemented in R and is available at https://cran.r-project.org/web/packages/BICORN/index.html.

Download Full-text