Encoding Gene Expression Using Deep Autoencoders for Expression Inference

Author(s):  
Raju Bhukya

Gene expression of an organism contains all the information that characterises its observable traits. Researchers have invested abundant time and money to quantitatively measure the expressions in laboratories. On account of such techniques being too expensive to be widely used, the correlation between expressions of certain genes was exploited to develop statistical solutions. Pioneered by the National Institutes of Health Library of Integrated Network-Based Cellular Signature (NIH LINCS) program, expression inference techniques has many improvements over the years. The Deep Learning for Gene expression (D-GEX) project by University of California, Irvine approached the problem from a machine learning perspective, leading to the development of a multi-layer feedforward neural network to infer target gene expressions from clinically measured landmark expressions. Still, the huge number of genes to be inferred from a limited set of known expressions vexed the researchers. Ignoring possible correlation between target genes, they partitioned the target genes randomly and built separate networks to infer their expressions. This paper proposes that the dimensionality of the target set can be virtually reduced using deep autoencoders. Feedforward networks will be used to predict the coded representation of target expressions. In spite of the reconstruction error of the autoencoder, overall prediction error on the microarray based Gene Expression Omnibus (GEO) dataset was reduced by 6.6%, compared to D-GEX. An improvement of 16.64% was obtained on cross platform normalized data obtained by combining the GEO dataset and an RNA-Seq based 1000G dataset.

2019 ◽  
Vol 17 (3) ◽  
pp. 422-431
Author(s):  
Raju Bhukya ◽  
Achyuth Ashok

In the field of molecular biology, gene expression is a term that encompasses all the information contained in an organism’s genome. Although, researchers have developed several clinical techniques to quantitatively measure the expressions of genes of an organism, they are too costly to be extensively used. The NIH LINCS program revealed that human gene expressions are highly correlated. Further research at the University of California, Irvine (UCI) led to the development of D-GEX, a Multi Layer Perceptron (MLP) model that was trained to predict unknown target expressions from previously identified landmark expressions. But, bowing to hardware limitations, they had split the target genes into different sets and constructed separate models to profile the whole genome. This paper proposes an alternative solution using a combination of deep autoencoder and MLP to overcome this bottleneck and improve the prediction performance. The microarray based Gene Expression Omnibus (GEO) dataset was employed to train the neural networks. Experimental result shows that this new model, abbreviated as E-GEX, outperforms D-GEX by 16.64% in terms of overall prediction accuracy on GEO dataset. The models were further tested on an RNA-Seq based 1000G dataset and E-GEX was found to be 49.23% more accurate than D-GEX.


2020 ◽  
Vol 16 ◽  
pp. 117693432092057
Author(s):  
Lijun Yu ◽  
Meiyan Wei ◽  
Fengyan Li

Despite advances in the treatment of cervical cancer (CC), the prognosis of patients with CC remains to be improved. This study aimed to explore candidate gene targets for CC. CC datasets were downloaded from the Gene Expression Omnibus database. Genes with similar expression trends in varying steps of CC development were clustered using Short Time-series Expression Miner (STEM) software. Gene functions were then analyzed using the Gene Ontology (GO) database and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. Protein interactions among genes of interest were predicted, followed by drug-target genes and prognosis-associated genes. The expressions of the predicted genes were determined using real-time quantitative polymerase chain reaction (RT-qPCR) and Western blotting. Red and green profiles with upward and downward gene expressions, respectively, were screened using STEM software. Genes with increased expression were significantly enriched in DNA replication, cell-cycle-related biological processes, and the p53 signaling pathway. Based on the predicted results of the Drug-Gene Interaction database, 17 drug-gene interaction pairs, including 3 red profile genes (TOP2A, RRM2, and POLA1) and 16 drugs, were obtained. The Cancer Genome Atlas data analysis showed that high POLA1 expression was significantly correlated with prolonged survival, indicating that POLA1 is protective against CC. RT-qPCR and Western blotting showed that the expressions of TOP2A, RRM2, and POLA1 gradually increased in the multistep process of CC. TOP2A, RRM2, and POLA1 may be targets for the treatment of CC. However, many studies are needed to validate our findings.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Huahe Zhu ◽  
Shun Wang ◽  
Cong Shan ◽  
Xiaoqian Li ◽  
Bo Tan ◽  
...  

AbstractXuan-bai-cheng-qi decoction (XCD), a traditional Chinese medicine (TCM) prescription, has been widely used to treat a variety of respiratory diseases in China, especially to seriously infectious diseases such as acute lung injury (ALI). Due to the complexity of the chemical constituent, however, the underlying pharmacological mechanism of action of XCD is still unclear. To explore its protective mechanism on ALI, firstly, a network pharmacology experiment was conducted to construct a component-target network of XCD, which identified 46 active components and 280 predicted target genes. Then, RNA sequencing (RNA-seq) was used to screen differentially expressed genes (DEGs) between ALI model rats treated with and without XCD and 753 DEGs were found. By overlapping the target genes identified using network pharmacology and DEGs using RNA-seq, and subsequent protein–protein interaction (PPI) network analysis, 6 kernel targets such as vascular epidermal growth factor (VEGF), mammalian target of rapamycin (mTOR), AKT1, hypoxia-inducible factor-1α (HIF-1α), and phosphoinositide 3-kinase (PI3K) and gene of phosphate and tension homology deleted on chromsome ten (PTEN) were screened out to be closely relevant to ALI treatment. Verification experiments in the LPS-induced ALI model rats showed that XCD could alleviate lung tissue pathological injury through attenuating proinflammatory cytokines release such as tumor necrosis factor (TNF)-α, interleukin (IL)-6, and IL-1β. Meanwhile, both the mRNA and protein expression levels of PI3K, mTOR, HIF-1α, and VEGF in the lung tissues were down-regulated with XCD treatment. Therefore, the regulations of XCD on PI3K/mTOR/HIF-1α/VEGF signaling pathway was probably a crucial mechanism involved in the protective mechanism of XCD on ALI treatment.


mSystems ◽  
2020 ◽  
Vol 5 (6) ◽  
Author(s):  
Kumari Sonal Choudhary ◽  
Julia A. Kleinmanns ◽  
Katherine Decker ◽  
Anand V. Sastry ◽  
Ye Gao ◽  
...  

ABSTRACT Escherichia coli uses two-component systems (TCSs) to respond to environmental signals. TCSs affect gene expression and are parts of E. coli’s global transcriptional regulatory network (TRN). Here, we identified the regulons of five TCSs in E. coli MG1655: BaeSR and CpxAR, which were stimulated by ethanol stress; KdpDE and PhoRB, induced by limiting potassium and phosphate, respectively; and ZraSR, stimulated by zinc. We analyzed RNA-seq data using independent component analysis (ICA). ChIP-exo data were used to validate condition-specific target gene binding sites. Based on these data, we do the following: (i) identify the target genes for each TCS; (ii) show how the target genes are transcribed in response to stimulus; and (iii) reveal novel relationships between TCSs, which indicate noncognate inducers for various response regulators, such as BaeR to iron starvation, CpxR to phosphate limitation, and PhoB and ZraR to cell envelope stress. Our understanding of the TRN in E. coli is thus notably expanded. IMPORTANCE E. coli is a common commensal microbe found in the human gut microenvironment; however, some strains cause diseases like diarrhea, urinary tract infections, and meningitis. E. coli’s two-component systems (TCSs) modulate target gene expression, especially related to virulence, pathogenesis, and antimicrobial peptides, in response to environmental stimuli. Thus, it is of utmost importance to understand the transcriptional regulation of TCSs to infer bacterial environmental adaptation and disease pathogenicity. Utilizing a combinatorial approach integrating RNA sequencing (RNA-seq), independent component analysis, chromatin immunoprecipitation coupled with exonuclease treatment (ChIP-exo), and data mining, we suggest five different modes of TCS transcriptional regulation. Our data further highlight noncognate inducers of TCSs, which emphasizes the cross-regulatory nature of TCSs in E. coli and suggests that TCSs may have a role beyond their cognate functionalities. In summary, these results can lead to an understanding of the metabolic capabilities of bacteria and correctly predict complex phenotype under diverse conditions, especially when further incorporated with genome-scale metabolic models.


2017 ◽  
Vol 29 (1) ◽  
pp. 173
Author(s):  
Z. Jiang ◽  
J. Sun ◽  
S. Marjani ◽  
H. Dong ◽  
X. Zheng ◽  
...  

Appropriate reference genes for accurate normalization in RT-PCR are essential for the study of gene expression. Ideal reference genes should not only have stable expression across stages of embryo development, but also be expressed at comparable levels to the target genes. Using RNA-seq data from in vivo-produced bovine oocytes and embryos from the 2-cell to blastocyst stage (Jiang et al., 2014 BMC Genomics 15, 756), we tried to establish a catalogue of all reference genes for RT-PCR analysis. One-way ANOVA generated 4055 genes that did not differ across stages. To reduce this list, we used the entire RNA-seq data set and first removed genes with a FPKM (fragments per kilobase of transcript per million mapped reads) of <1, and then rescaled each gene’s expression values within a range of 0 to 1. We subsequently calculated the expression variance for each gene across all stages. By assuming that the calculated variances follow a Gaussian distribution and that the majority of the genes do not have a stable expression level, a gene was classified as a reference if its variance significantly deviated (P < 0.05) from these assumptions. We identified 346 potential reference genes, all of which were among the candidates from the ANOVA analysis. We arbitrarily assigned genes in this list to high (FPKM ≥ 100), medium (10 < FPKM < 100), and low expression levels (FPKM ≤ 10), and 37, 154, and 155 genes, respectively, fell into these groups. Surprisingly, none of the commonly used reference genes, such as GAPDH, PPIA, ACTB, PRL15, GUSB, and H3F2A, were identified as being stably expressed across in vivo development. This is consistent with findings of prior RT-PCR studies (Robert et al. 2002 Biol. Reprod. 67, 1465–1472; Ross et al. 2010 Cell Reprogram. 12, 709–717). The following gene ontology terms were significantly enriched for the 346 genes: cell cycle, translation, transport, chromatin, cell division, and metabolic process, indicating that the early embryos maintained constant levels of genes involved in fundamental biological functions. Finally, we performed RT-PCR to validate the RNA-seq results using different bovine in vivo-derived oocytes and embryos (n = 3/stage). We successfully validated 10 selected genes, including those in the high (CS, PGD, and ACTR3), medium (CCT5, MRPL47, COG2, CRT9, and HELLS), and low expression groups (CDC23 and TTF1). In conclusion, we recommend the use of reference genes that are expressed at comparable levels to target genes. This study offers a useful resource to aid in the appropriate selection of reference genes, which will improve the accuracy of quantitative gene expression analyses across bovine embryo pre-implantation development.


Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Constance M Smith ◽  
James A Kadin ◽  
Richard M Baldarelli ◽  
Jonathan S Beal ◽  
Olin Blodgett ◽  
...  

Abstract The Gene Expression Database (GXD), an extensive community resource of curated expression information for the mouse, has developed an RNA-Seq and Microarray Experiment Search (http://www.informatics.jax.org/gxd/htexp_index). This tool allows users to quickly and reliably find specific experiments in ArrayExpress and the Gene Expression Omnibus (GEO) that study endogenous gene expression in wild-type and mutant mice. Standardized metadata annotations, curated by GXD, allow users to specify the anatomical structure, developmental stage, mutated gene, strain and sex of samples of interest, as well as the study type and key parameters of the experiment. These searches, powered by controlled vocabularies and ontologies, can be combined with free text searching of experiment titles and descriptions. Search result summaries include link-outs to ArrayExpress and GEO, providing easy access to the expression data itself. Links to the PubMed entries for accompanying publications are also included. More information about this tool and GXD can be found at the GXD home page (http://www.informatics.jax.org/expression.shtml). Database URL: http://www.informatics.jax.org/expression.shtml


2014 ◽  
Vol 2014 ◽  
pp. 1-8
Author(s):  
Tzu-Hao Chang ◽  
Shih-Lin Wu ◽  
Wei-Jen Wang ◽  
Jorng-Tzong Horng ◽  
Cheng-Wei Chang

Microarrays are widely used to assess gene expressions. Most microarray studies focus primarily on identifying differential gene expressions between conditions (e.g., cancer versus normal cells), for discovering the major factors that cause diseases. Because previous studies have not identified the correlations of differential gene expression between conditions, crucial but abnormal regulations that cause diseases might have been disregarded. This paper proposes an approach for discovering the condition-specific correlations of gene expressions within biological pathways. Because analyzing gene expression correlations is time consuming, an Apache Hadoop cloud computing platform was implemented. Three microarray data sets of breast cancer were collected from the Gene Expression Omnibus, and pathway information from the Kyoto Encyclopedia of Genes and Genomes was applied for discovering meaningful biological correlations. The results showed that adopting the Hadoop platform considerably decreased the computation time. Several correlations of differential gene expressions were discovered between the relapse and nonrelapse breast cancer samples, and most of them were involved in cancer regulation and cancer-related pathways. The results showed that breast cancer recurrence might be highly associated with the abnormal regulations of these gene pairs, rather than with their individual expression levels. The proposed method was computationally efficient and reliable, and stable results were obtained when different data sets were used. The proposed method is effective in identifying meaningful biological regulation patterns between conditions.


2007 ◽  
Vol 25 (18_suppl) ◽  
pp. 7669-7669 ◽  
Author(s):  
C. Huang ◽  
D. Liu ◽  
J. Nakano ◽  
S. Ishikawa ◽  
H. Yokomise ◽  
...  

7669 Background: The thymidylate synthase (TS) expression is related to 5-FU sensitivity. The survivin expression is associated with tumor apoptosis, an indicator to predict the efficacy of chemotherapy. Recently, TS and Survivin have been reported to be E2F1 target genes. We investigate the clinical significance of the E2F1 gene expression in relation to gene expressions of TS and Survivin among non-small cell lung cancer (NSCLC). Methods: One hundred and twenty-seven NSCLC patients were investigated. Quantitative RT-PCR was performed to evaluate gene expressions of E2F1, TS, and survivin. The Ki-67 proliferation index and the apoptotic index using TUNEL method were also evaluated. Results: The E2F1 gene expression was significantly higher in stage II to III tumors than in stage I tumors (p=0.006). The E2F1 gene expression significantly correlated with the Ki-67 proliferation index (p<0.001), while no correlation was observed between the E2F1 gene expression and the apoptotic index. Regarding E2F1-target genes, the E2F1 gene expression significantly correlated with the TS gene expression (p<0.001). The E2F1 gene expression also significantly correlated with the survivin gene expression (p<0.001). The TS expression and the survivin expression significantly correlated with the Ki-67 proliferation index (p<0.001 and p<0.001, respectively). There was a significant inverse relationship between the survivin expression and the apoptotic index (p<0.001). The overall survival was significantly lower in patients with high-E2F1 tumors than in those with low-E2F1 tumors (p=0.002), especially among patients with stage II to III NSCLCs (p=0.018). The Cox regression analysis demonstrated that the E2F1 status was a significant prognostic factor for NSCLC patients (p=0.026). Conclusions: The present study revealed the E2F1 gene expression to correlate with TS and survivin gene expressions, and tumor proliferation. E2F1 overexpression could occur to produce more aggressive tumors with high proliferation rate and chemo-resistance during progression of NSCLCs. The suppression of E2F1 by RNA interference would be a useful strategy for cancer gene therapy. No significant financial relationships to disclose.


2020 ◽  
Author(s):  
Rwik Sen ◽  
Ezra Lencer ◽  
Elizabeth A. Geiger ◽  
Kenneth L. Jones ◽  
Tamim H. Shaikh ◽  
...  

AbstractCongenital Heart Defects (CHDs) are the most common form of birth defects, observed in 4-10/1000 live births. CHDs result in a wide range of structural and functional abnormalities of the heart which significantly affect quality of life and mortality. CHDs are often seen in patients with mutations in epigenetic regulators of gene expression, like the genes implicated in Kabuki syndrome – KMT2D and KDM6A, which play important roles in normal heart development and function. Here, we examined the role of two epigenetic histone modifying enzymes, KMT2D and KDM6A, in the expression of genes associated with early heart and neural crest cell (NCC) development. Using CRISPR/Cas9 mediated mutagenesis of kmt2d, kdm6a and kdm6al in zebrafish, we show cardiac and NCC gene expression is reduced, which correspond to affected cardiac morphology and reduced heart rates. To translate our results to a human pathophysiological context and compare transcriptomic targets of KMT2D and KDM6A across species, we performed RNA sequencing (seq) of lymphoblastoid cells from Kabuki Syndrome patients carrying mutations in KMT2D and KDM6A. We compared the human RNA-seq datasets with RNA-seq datasets obtained from mouse and zebrafish. Our comparative interspecies analysis revealed common targets of KMT2D and KDM6A, which are shared between species, and these target genes are reduced in expression in the zebrafish mutants. Taken together, our results show that KMT2D and KDM6A regulate common and unique genes across humans, mice, and zebrafish for early cardiac and overall development that can contribute to the understanding of epigenetic dysregulation in CHDs.


2021 ◽  
Vol 18 (17) ◽  
Author(s):  
Micheal Olaolu AROWOLO ◽  
Marion Olubunmi ADEBIYI ◽  
Chiebuka Timothy NNODIM ◽  
Sulaiman Olaniyi ABDULSALAM ◽  
Ayodele Ariyo ADEBIYI

As mosquito parasites breed across many parts of the sub-Saharan Africa part of the world, infected cells embrace an unpredictable and erratic life period. Millions of individual parasites have gene expressions. Ribonucleic acid sequencing (RNA-seq) is a popular transcriptional technique that has improved the detection of major genetic probes. The RNA-seq analysis generally requires computational improvements of machine learning techniques since it computes interpretations of gene expressions. For this study, an adaptive genetic algorithm (A-GA) with recursive feature elimination (RFE) (A-GA-RFE) feature selection algorithms was utilized to detect important information from a high-dimensional gene expression malaria vector RNA-seq dataset. Support Vector Machine (SVM) kernels were used as the classification algorithms to evaluate its predictive performances. The feasibility of this study was confirmed by using an RNA-seq dataset from the mosquito Anopheles gambiae. The technique results in related performance had 98.3 and 96.7 % accuracy rates, respectively. HIGHLIGHTS Dimensionality reduction method based of feature selection Classification using Support vector machine Classification of malaria vector dataset using an adaptive GA-RFE-SVM GRAPHICAL ABSTRACT


Sign in / Sign up

Export Citation Format

Share Document