scholarly journals GEDI: an R package for integration of transcriptomic data from multiple high-throughput platforms

2021 ◽  
Author(s):  
Mathias N Stokholm ◽  
Maria B Rabaglino ◽  
Haja N Kadarmideen

Transcriptomic data is often expensive and difficult to generate in large cohorts in comparison to genomic data and therefore is often important to integrate multiple transcriptomic datasets from both microarray and next generation sequencing (NGS) based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including re-annotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining already existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically re-annotating the data and removing the batch effect. The removal of the batch effect is verified with Principal Component Analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. The datasets included Affymetrix, Agilent and RNA-sequencing data. Furthermore, we compared the GEDI package to already existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration.

2017 ◽  
Author(s):  
Maren Büttner ◽  
Zhichao Miao ◽  
F Alexander Wolf ◽  
Sarah A Teichmann ◽  
Fabian J Theis

AbstractSingle-cell transcriptomics is a versatile tool for exploring heterogeneous cell populations. As with all genomics experiments, batch effects can hamper data integration and interpretation. The success of batch effect correction is often evaluated by visual inspection of dimension-reduced representations such as principal component analysis. This is inherently imprecise due to the high number of genes and non-normal distribution of gene expression. Here, we present a k-nearest neighbour batch effect test (kBET, https://github.com/theislab/kBET) to quantitatively measure batch effects. kBET is easier to interpret, more sensitive and more robust than visual evaluation and other measures of batch effects. We use kBET to assess commonly used batch regression and normalisation approaches, and quantify the extent to which they remove batch effects while preserving biological variability. Our results illustrate that batch correction based on log-transformation or scran pooling followed by ComBat reduced the batch effect while preserving structure across data sets. Finally we show that kBET can pinpoint successful data integration methods across multiple data sets, in this case from different publications all charting mouse embryonic development. This has important implications for future data integration efforts, which will be central to projects such as the Human Cell Atlas where data for the same tissue may be generated in multiple locations around the world.[Before final publication, we will upload the R package to Bioconductor]


2020 ◽  
Vol 26 (29) ◽  
pp. 3619-3630
Author(s):  
Saumya Choudhary ◽  
Dibyabhaba Pradhan ◽  
Noor S. Khan ◽  
Harpreet Singh ◽  
George Thomas ◽  
...  

Background: Psoriasis is a chronic immune mediated skin disorder with global prevalence of 0.2- 11.4%. Despite rare mortality, the severity of the disease could be understood by the accompanying comorbidities, that has even led to psychological problems among several patients. The cause and the disease mechanism still remain elusive. Objective: To identify potential therapeutic targets and affecting pathways for better insight of the disease pathogenesis. Method: The gene expression profile GSE13355 and GSE14905 were retrieved from NCBI, Gene Expression Omnibus database. The GEO profiles were integrated and the DEGs of lesional and non-lesional psoriasis skin were identified using the affy package in R software. The Kyoto Encyclopaedia of Genes and Genomes pathways of the DEGs were analyzed using clusterProfiler. Cytoscape, V3.7.1 was utilized to construct protein interaction network and analyze the interactome map of candidate proteins encoded in DEGs. Functionally relevant clusters were detected through Cytohubba and MCODE. Results: A total of 1013 genes were differentially expressed in lesional skin of which 557 were upregulated and 456 were downregulated. Seven dysregulated genes were extracted in non-lesional skin. The disease gene network of these DEGs revealed 75 newly identified differentially expressed gene that might have a role in development and progression of the disease. GO analysis revealed keratinocyte differentiation and positive regulation of cytokine production to be the most enriched biological process and molecular function. Cytokines -cytokine receptor was the most enriched pathways. Among 1013 identified DEGs in lesional group, 36 DEGs were found to have altered genetic signature including IL1B and STAT3 which are also reported as hub genes. CCNB1, CCNA2, CDK1, IL1B, CXCL8, MKI 67, ESR1, UBE2C, STAT1 and STAT3 were top 10 hub gene. Conclusion: The hub genes, genomic altered DEGs and other newly identified differentially dysregulated genes would improve our understanding of psoriasis pathogenesis, moreover, the hub genes could be explored as potential therapeutic targets for psoriasis.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Fanyan Meng ◽  
Ningna Du ◽  
Daoming Xu ◽  
Li Kuai ◽  
Lanying Liu ◽  
...  

Ankylosing spondylitis (AS) is an autoimmune disease that mainly affects the spinal joints, sacroiliac joints, and adjacent soft tissues. We conducted bioinformatics analysis to explore the molecular mechanism related to AS pathogenesis and uncover novel potential molecular targets for the treatment of AS. The profiles of GSE25101, containing gene expression data extracted from the blood of 16 AS patients and 16 matched controls, were acquired from the Gene Expression Omnibus (GEO) database. The background correction and standardization were carried out utilizing the transcript per million (TPM) method. After analysis of AS patients and the normal groups, we identified 199 differentially expressed genes (DEGs) with upregulation and 121 DEGs with downregulation by the limma R package. The results of the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Gene Ontology (GO) biological process enrichment analysis revealed that the DEGs with upregulation were mainly associated with spliceosome, ribosome, RNA-catabolic process, electron transport chain, etc. And the DEGs with downregulation primarily participated in T cell-associated pathways and processes. After analysis of the protein-protein interaction (PPI) network, our data revealed that the hub genes, comprising MRPL13, MRPL22, LSM3, COX7A2, COX7C, EP300, PTPRC, and CD4, could be the treatment targets in AS. Our data furnish new hints to uncover the features of AS and explore more promising treatment targets towards AS.


2019 ◽  
Author(s):  
Bastian Seelbinder ◽  
Thomas Wolf ◽  
Steffen Priebe ◽  
Sylvie McNamara ◽  
Silvia Gerber ◽  
...  

ABSTRACTIn transcriptomics, the study of the total set of RNAs transcribed by the cell, RNA sequencing (RNA-seq) has become the standard tool for analysing gene expression. The primary goal is the detection of genes whose expression changes significantly between two or more conditions, either for a single species or for two or more interacting species at the same time (dual RNA-seq, triple RNA-seq and so forth). The analysis of RNA-seq can be simplified as many steps of the data pre-processing can be standardised in a pipeline.In this publication we present the “GEO2RNAseq” pipeline for complete, quick and concurrent pre-processing of single, dual, and triple RNA-seq data. It covers all pre-processing steps starting from raw sequencing data to the analysis of differentially expressed genes, including various tables and figures to report intermediate and final results. Raw data may be provided in FASTQ format or can be downloaded automatically from the Gene Expression Omnibus repository. GEO2RNAseq strongly incorporates experimental as well as computational metadata. GEO2RNAseq is implemented in R, lightweight, easy to install via Conda and easy to use, but still very flexible through using modular programming and offering many extensions and alternative workflows.GEO2RNAseq is publicly available at https://anaconda.org/xentrics/r-geo2rnaseq and https://bitbucket.org/thomas_wolf/geo2rnaseq/overview, including source code, installation instruction, and comprehensive package documentation.


2021 ◽  
Vol 30 (4) ◽  
pp. 444-452
Author(s):  
Kyung-Wan Baek ◽  
So-Jeong Kim ◽  
Ji-Seok Kim ◽  
Sun-Ok Kwon

PURPOSE: This study evaluates the differences in the expression of genes frequently analyzed in the field of exercise science between the skeletal muscle tissue and various cell types that comprise the skeletal muscle tissue.METHODS: We summarized the genes and proteins expressed in the skeletal muscle that were published in “Exercise Science” journal from 2015 to present. Thereafter, we selected 15 genes and proteins that were the most analyzed genes and proteins in the skeletal muscle. These genes and proteins were horizontally compared for expression differences in skeletal muscle components and cultured cells based on NCBI Gene Expression Omnibus DataSets.RESULTS: The most analyzed genes (encoding analyzed proteins) in skeletal muscle tissues in “Exercise Science” were PPARGC1A, PPARD, MTOR, MAP1LC3A, MAP1LC3B, PRKAA1, AKT1, SLC2A4, MAPK1, COX4I1, MAPK14, MEF2A, MAPK8, RPS6KB1, and SOD1. Among them, PPARGC1A, AKT1, SLC2A4, MAPK1, and COX4I1 were specifically expressed in the skeletal muscle. However, expression of other genes was found to be significantly affected in other cell types of the skeletal muscle tissue.CONCLUSIONS: Genes such as PPARGC1A, which are specifically expressed in the skeletal muscle, may be analyzed without pretreating (such as perfusion) the skeletal muscle tissue. However, expression of other genes may depend on the skeletal muscle cell type. Thus, in such instances, pretreatment, such as perfusion and isolation, should be considered.


2021 ◽  
Author(s):  
Pegah Einaliyan ◽  
Ali Owfi ◽  
Mohammadamin Mahmanzar ◽  
Taha Aghajanzadeh ◽  
Morteza Hadizadeh ◽  
...  

AbstractBackgroundCurrently, non-alcoholic fatty liver disease (NAFLD) is one of the most common chronic liver diseases in the world. Forecasting the short-term, up to 2025, NASH due to fibrosis is one of the leading causes of liver transplantation. Cohort studies revealed that non-alcoholic steatohepatitis (NASH) has a higher risk of fibrosis progression among NAFLD patients. Identifying differentially expressed genes helps to determine NASH pathogenic pathways, make more accurate diagnoses, and prescribe appropriate treatment.Methods and ResultsIn this study, we found 11 NASH datasets by searching in the Gene Expression Omnibus (GEO) database. Subsequently, NASH datasets with low-quality control scores were excluded. Four datasets were analyzed with packages of R/Bioconductor. Then, all integrated genes were Imported into Cytoscape to illustrate the protein-protein interactions network. All hubs and nodes degree has been calculated to determine the hub genes with critical roles in networks.Possible correlations between expression profiles of mutual DEGs were identified employing Principal Component Analysis (PCA). Primary analyzed data were filtered based on gene expression (logFC > 1, logFC < −1) and adj-P-value (<0.05). Ultimately, among 379 DEGs, we selected the top 10 genes (MYC, JUN, EGR1, FOS, CCL2, IL1B, CXCL8, PTGS2, IL6, SERPINE1) as candidates among up and down regulated genes, and critical pathways such as IL-6, IL-17, TGF β, and TNFα were identified.ConclusionThe present study suggests an important DEGs, biological processes, and critical pathways involved in the pathogenesis of NASH disease. Further investigations are needed to clarify the exact mechanisms underlying the development and progression of NASH disease.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Sepideh Dashti ◽  
Mohammad Taheri ◽  
Soudeh Ghafouri-Fard

Abstract Breast cancer is a highly heterogeneous disorder characterized by dysregulation of expression of numerous genes and cascades. In the current study, we aim to use a system biology strategy to identify key genes and signaling pathways in breast cancer. We have retrieved data of two microarray datasets (GSE65194 and GSE45827) from the NCBI Gene Expression Omnibus database. R package was used for identification of differentially expressed genes (DEGs), assessment of gene ontology and pathway enrichment evaluation. The DEGs were integrated to construct a protein–protein interaction network. Next, hub genes were recognized using the Cytoscape software and lncRNA–mRNA co-expression analysis was performed to evaluate the potential roles of lncRNAs. Finally, the clinical importance of the obtained genes was assessed using Kaplan–Meier survival analysis. In the present study, 887 DEGs including 730 upregulated and 157 downregulated DEGs were detected between breast cancer and normal samples. By combining the results of functional analysis, MCODE, CytoNCA and CytoHubba 2 hub genes including MAD2L1 and CCNB1 were selected. We also identified 12 lncRNAs with significant correlation with MAD2L1 and CCNB1 genes. According to The Kaplan–Meier plotter database MAD2L1, CCNA2, RAD51-AS1 and LINC01089 have the most prediction potential among all candidate hub genes. Our study offers a framework for recognition of mRNA–lncRNA network in breast cancer and detection of important pathways that could be used as therapeutic targets in this kind of cancer.


2020 ◽  
Vol 36 (15) ◽  
pp. 4301-4308
Author(s):  
Stephan Seifert ◽  
Sven Gundlach ◽  
Olaf Junge ◽  
Silke Szymczak

Abstract Motivation High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. Results The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. Availability and implementation An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (10) ◽  
pp. 3115-3123 ◽  
Author(s):  
Teng Fei ◽  
Tianwei Yu

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Joshua D Fortriede ◽  
Troy J Pells ◽  
Stanley Chu ◽  
Praneet Chaturvedi ◽  
DongZhuo Wang ◽  
...  

Abstract Xenbase (www.xenbase.org) is a knowledge base for researchers and biomedical scientists that employ the amphibian Xenopus as a model organism in biomedical research to gain a deeper understanding of developmental and disease processes. Through expert curation and automated data provisioning from various sources Xenbase strives to integrate the body of knowledge on Xenopus genomics and biology together with the visualization of biologically significant interactions. Most current studies utilize next generation sequencing (NGS) but until now the results of different experiments were difficult to compare and not integrated with other Xenbase content. Xenbase has developed a suite of tools, interfaces and data processing pipelines that transforms NCBI Gene Expression Omnibus (GEO) NGS content into deeply integrated gene expression and chromatin data, mapping all aligned reads to the most recent genome builds. This content can be queried and visualized via multiple tools and also provides the basis for future automated ‘gene expression as a phenotype’ and gene regulatory network analyses.


Sign in / Sign up

Export Citation Format

Share Document