scholarly journals Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study

2020 ◽  
Vol 36 (15) ◽  
pp. 4301-4308
Author(s):  
Stephan Seifert ◽  
Sven Gundlach ◽  
Olaf Junge ◽  
Silke Szymczak

Abstract Motivation High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. Results The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. Availability and implementation An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Vol 7 (Supplement_1) ◽  
pp. S516-S517
Author(s):  
Kulachanya Suwanwongse ◽  
Nehad Shabarek

Abstract Background Human immunodeficiency virus (HIV) disease progression are different among genders, in which women usually progress to acquired immunodeficiency syndrome (AIDS) faster than men. The mechanisms resulting in the gender biases of HIV progression are unclear. We conducted a bioinformatics analysis of differentially expressed genes (DEGs) in women and men with HIV disease to understand the sex-based differences in HIV pathogenesis. Methods We obtained microarray data from the Gene Expression Omnibus (GEO) database using our pre-defined search strategy and analyzed data using the GEO2R platform. The t-test was done to compare DEGs between females and males with HIV diseases. The Database for Annotation, Visualization, and Integrated Discovery (DAVID) was implemented to systematically extract biological features and processes of retrieving DEGs via gene ontology (GO) analysis. A Systemic search was performed to evaluate each DEG function and its possible association with HIV. Results One gene expression profiling data were retrieved: GSE 140713, composed of 40 males and 10 females with HIV1 infected samples. A GEO2R analysis yielded 19 DEGs (Table 1). The GO analysis result was demonstrated in Tables 2 and 3. Following a systemic search, we found two DEGs, which have previous studies reported an association with HIV: DDX3X (20 studies) and PDS5 (1 study). We proposed DDX3X (t 5.3, p 0.0037) is responsible for gender inequalities of HIV progression because of: 1. DDX3X is needed in the HIV1 life cycle. 2. Several studies confirmed a positive correlation between DDX3X expression and HIV1 replication. 3. Our study found an up-regulated DDX3X expression in women corresponded to the fact that women progress to AIDS faster than men. 4. Our GO analysis showed female up-regulated genes were enriched in positive regulation of the gene expression pathway, which can be explained by DDX3X and its underlying mechanism. Table 1: DEGs in women and men with HIV1 disease Table 2: GO functional enrichment pathway analyses of overall retrieving DEGs Table 3: GO functional enrichment pathway analyses of down- and up-regulated clusters of DEGs Conclusion Aberrant DDX3X expression may contribute to sex-based differences in HIV disease. Drugs modifying DDX3X gene expression will be beneficial in the treatment of HIV especially resolving the HIV drug resistance problem because current anti-HIV drugs target viral components posed the risk of viral mutation. Disclosures All Authors: No reported disclosures


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Fanyan Meng ◽  
Ningna Du ◽  
Daoming Xu ◽  
Li Kuai ◽  
Lanying Liu ◽  
...  

Ankylosing spondylitis (AS) is an autoimmune disease that mainly affects the spinal joints, sacroiliac joints, and adjacent soft tissues. We conducted bioinformatics analysis to explore the molecular mechanism related to AS pathogenesis and uncover novel potential molecular targets for the treatment of AS. The profiles of GSE25101, containing gene expression data extracted from the blood of 16 AS patients and 16 matched controls, were acquired from the Gene Expression Omnibus (GEO) database. The background correction and standardization were carried out utilizing the transcript per million (TPM) method. After analysis of AS patients and the normal groups, we identified 199 differentially expressed genes (DEGs) with upregulation and 121 DEGs with downregulation by the limma R package. The results of the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Gene Ontology (GO) biological process enrichment analysis revealed that the DEGs with upregulation were mainly associated with spliceosome, ribosome, RNA-catabolic process, electron transport chain, etc. And the DEGs with downregulation primarily participated in T cell-associated pathways and processes. After analysis of the protein-protein interaction (PPI) network, our data revealed that the hub genes, comprising MRPL13, MRPL22, LSM3, COX7A2, COX7C, EP300, PTPRC, and CD4, could be the treatment targets in AS. Our data furnish new hints to uncover the features of AS and explore more promising treatment targets towards AS.


2019 ◽  
Vol 36 (8) ◽  
pp. 2608-2610
Author(s):  
Aritro Nath ◽  
Jeremy Chang ◽  
R Stephanie Huang

Abstract Summary MicroRNAs (miRNAs) are critical post-transcriptional regulators of gene expression. Due to challenges in accurate profiling of small RNAs, a vast majority of public transcriptome datasets lack reliable miRNA profiles. However, the biological consequence of miRNA activity in the form of altered protein-coding gene (PCG) expression can be captured using machine-learning algorithms. Here, we present iMIRAGE (imputed miRNA activity from gene expression), a convenient tool to predict miRNA expression using PCG expression of the test datasets. The iMIRAGE package provides an integrated workflow for normalization and transformation of miRNA and PCG expression data, along with the option to utilize predicted miRNA targets to impute miRNA activity from independent test PCG datasets. Availability and implementation The iMIRAGE package for R, along with package documentation and vignette, is available at https://aritronath.github.io/iMIRAGE/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Weiguang Mao ◽  
Javad Rahimikollu ◽  
Ryan Hausler ◽  
Maria Chikina

Abstract Motivation RNA-seq technology provides unprecedented power in the assessment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study. Results We describe a generalization of singular value decomposition-based reconstruction for which the common techniques of whitening, rank-k approximation and removing the top k principal components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweigh the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally, we apply our method to the Religious Orders Study and Memory and Aging Project dataset, and we report what to our knowledge is the first replicable trans-eQTL effect in human brain. Availabilityand implementation DataRemix is an R package which is freely available at GitHub (https://github.com/wgmao/DataRemix). Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Mathias N Stokholm ◽  
Maria B Rabaglino ◽  
Haja N Kadarmideen

Transcriptomic data is often expensive and difficult to generate in large cohorts in comparison to genomic data and therefore is often important to integrate multiple transcriptomic datasets from both microarray and next generation sequencing (NGS) based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including re-annotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining already existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically re-annotating the data and removing the batch effect. The removal of the batch effect is verified with Principal Component Analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. The datasets included Affymetrix, Agilent and RNA-sequencing data. Furthermore, we compared the GEDI package to already existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration.


2021 ◽  
Author(s):  
Mohib kakar ◽  
Muhammad Mehboob ◽  
Muhammad Akram ◽  
Imran Iqbal ◽  
Hafza Ijaz ◽  
...  

Abstract Objective The goal of this study was to understand possible core genes associated with hepatocellular carcinoma (HCC) pathogenesis and prognosis. Methods GEO contains datasets of gene expression, miRNA and methylation patterns of diseased and healthy/control patients. GSE62232 Dataset was selected by employing the server Gene Expression Omnibus. A total of 91 samples were collected, including 81 HCC samples and 10 healthy samples as control. GSE62232 was analyzed through GEO2R, and Functional Enrichment Analysis was performed to extract rational information from a set of DEGs. The Protein-Protein Relationship Networking search method has been used for extracting genes interacting. MCC method was used to calculate the top 10 genes according to their importance. Hub genes in the network were analyzed using GEPIA to estimate the effect of their differential expression on cancer progression. Results We identified the top 10 hub genes through Cytohubba plugin. These genes include Cell Cycle Regulatory Cyclins and Cyclin-dependent proteins CCNA2, CCNB1 and CDK1. The pathogenesis and prognosis of HCC may be directly linked with the aforementioned genes. Conclusion In this analysis, we found critical genes for HCC that showed recommendations for more diagnostic and predictive biomarkers studies that could promote selective molecular therapy for HCC.


2021 ◽  
Author(s):  
Mohib kakar ◽  
Muhammad Mehboob ◽  
Muhammad Akram ◽  
Imran Iqbal ◽  
Hafza Ijaz ◽  
...  

Abstract The goal of this study was to understand possible core genes associated with hepatocellular carcinoma (HCC) pathogenesis and prognosis. Gene Expression Omnibus (GEO) contains datasets of gene expression, miRNA and methylation patterns of diseased and healthy/control patients. GSE62232 Dataset was selected by employing the server GEO. A total of 91 samples were collected, including 81 HCC samples and 10 healthy samples as control. GSE62232 was analyzed through GEO2R, and functional enrichment analysis was performed to extract rational information from a set of DEGs. The protein-protein relationship networking search method was used for extracting interacting genes. MCC method was used to calculate the top 10 genes according to their importance. Hub genes in the network were analyzed using GEPIA to estimate the effect of their differential expression on cancer progression. We identified the top 10 hub genes through Cytohubba plugin. These genes include cell cycle regulatory cyclins and cyclin-dependent proteins CCNA2, CCNB1 and CDK1. The pathogenesis and prognosis of HCC may be directly linked with the aforementioned genes. In this analysis, we found critical genes for HCC that showed recommendations for more diagnostic and predictive biomarker studies that could promote selective molecular therapy for HCC.


2020 ◽  
Vol 36 (9) ◽  
pp. 2932-2933 ◽  
Author(s):  
Angela Serra ◽  
Laura Aliisa Saarimäki ◽  
Michele Fratello ◽  
Veer Singh Marwah ◽  
Dario Greco

Abstract Motivation The analysis of dose-dependent effects on the gene expression is gaining attention in the field of toxicogenomics. Currently available computational methods are usually limited to specific omics platforms or biological annotations and are able to analyse only one experiment at a time. Results We developed the software BMDx with a graphical user interface for the Benchmark Dose (BMD) analysis of transcriptomics data. We implemented an approach based on the fitting of multiple models and the selection of the optimal model based on the Akaike Information Criterion. The BMDx tool takes as an input a gene expression matrix and a phenotype table, computes the BMD, its related values, and IC50/EC50 estimations. It reports interactive tables and plots that the user can investigate for further details of the fitting, dose effects and functional enrichment. BMDx allows a fast and convenient comparison of the BMD values of a transcriptomics experiment at different time points and an effortless way to interpret the results. Furthermore, BMDx allows to analyse and to compare multiple experiments at once. Availability and implementation BMDx is implemented as an R/Shiny software and is available at https://github.com/Greco-Lab/BMDx/. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 19 ◽  
pp. 153303382097748
Author(s):  
Shao-wei Zhang ◽  
Nan Zhang ◽  
Na Wang

Background: Esophageal cancer (EC) is a primary malignant tumor originating from the esophageal of the epithelium. Surgical resection is a potential treatment for EC, but this is only appropriate for patients who have locally resectable lesions suitable for surgery. However, most patients with EC are at a late stage when diagnosed. Therefore, there is an urgent need to further explore the pathogenesis of EC to enable early diagnosis and treatment. Methods: Our study downloaded 2 expression spectrum datasets (GSE92396 and GSE100942) in the Gene Expression Omnibus (GEO) database. GEO2 R was used to identify the Differentially expressed genes (DEGs) between the samples of EC and control. Using the DAVID tool to make the Functional enrichment analysis. Constructing A protein–protein interaction (PPI) network. Identifying the Hub genes. The impact of hub gene expression on overall survival and their expression based on immunohistochemistry were analyzed. Associated microRNAs were also predicted. Results: There were 36 common DEGs identified. The analysis of GO and KEGG results shown that the variations were predominantly concentrated in the extracellular matrix (ECM), ECM organization, DNA binding, platelet activation, and ECM-receptor interactions. COL3A1 and POSTN had high expression in EC tissues which was compared with their expression in healthy tissues. Analysis of pathologic stages showed that when COL3A1 and POSTN were highly expressed, the stage of the pathologic of EC patients was relatively high (P < 0.005). Conclusions: COL3A1 and POSTN may play an important role in the advancement and occurrence of EC. These genes could provide some novel ideas and basis for the diagnosis and targeted treatment of EC.


2021 ◽  
Author(s):  
YiQun Ma ◽  
LISHI SHAO ◽  
CHEN SHI ◽  
JIAPING WANG

Abstract Background: Infection with hepatitis C virus (HCV) can cause hepatic fibrosis and cirrhosis, thereby significantly increasing the risk of HCC development. Many prior studies have shown that oncogenesis and cancer progression are governed by competing endogenous RNA (ceRNA) networks composed of long non-coding RNAs (lncRNAs), microRNAs (miRNAs), and mRNAs. As such, we herein sought to identify and evaluate the prognostic relevance of novel ceRNA network related to HCC associated with HCV. Methods: Differentially expressed genes (DEGs) in the GSE140845 Gene Expression Omnibus (GEO) dataset were identified using NetworkAnalyst, and were subjected to Gene Ontology (GO) and Kyoto Encyclopedia of Gene, Genome (KEGG) pathway, and Reactome analyses. In addition, a protein-protein interaction (PPI) network was generated, and key hub genes were detected. Hub gene expression levels, as well as those of their upstream lncRNAs and miRNAs and associated survival analyses were conducted using appropriate bioinformatics databases. Predicted target relationships were additionally used to establish putative ceRNA networks for HCV-related HCC. Results: 372 and 360 upregulated and downregulated significant DEGs were identified, respectively. Functional enrichment analyses suggested that DE-mRNAs were associated with nuclear division, the cell cycle, and ATPase activity. The top 11 genes with the greatest degree of connectivity among these DE-mRNAs were selected for subsequent prognostic evaluation. The differential expression of six of these candidate mRNAs (BUB1, BUB1B, CDC20, CDC45, CDK1, NDC80) in liver tissue was validated. After further analyses of the expression and prognostic relevance of the miRNAs and lncRNAs predicted to lie upstream of these DE-mRNAs, we identified 22 miRNAs and 4 lncRNAs significantly associated with poorer-HCV-related HCC prognosis. By combining the results of these analyses, we also identified the BUB1-hsa-miR-193a-3p-MALAT1 ceRNA sub-network as being related to the survival of these patients. Conclusion: This study providing novel insights into the mRNA-miRNA-lncRNA ceRNA network and reveals potential lncRNA biomarkers in HCV related HCC.


Sign in / Sign up

Export Citation Format

Share Document