scholarly journals GEOlimma: differential expression analysis and feature selection using pre-existing microarray data

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Liangqun Lu ◽  
Kevin A. Townsend ◽  
Bernie J. Daigle

Abstract Background Differential expression and feature selection analyses are essential steps for the development of accurate diagnostic/prognostic classifiers of complicated human diseases using transcriptomics data. These steps are particularly challenging due to the curse of dimensionality and the presence of technical and biological noise. A promising strategy for overcoming these challenges is the incorporation of pre-existing transcriptomics data in the identification of differentially expressed (DE) genes. This approach has the potential to improve the quality of selected genes, increase classification performance, and enhance biological interpretability. While a number of methods have been developed that use pre-existing data for differential expression analysis, existing methods do not leverage the identities of experimental conditions to create a robust metric for identifying DE genes. Results In this study, we propose a novel differential expression and feature selection method—GEOlimma—which combines pre-existing microarray data from the Gene Expression Omnibus (GEO) with the widely-applied Limma method for differential expression analysis. We first quantify differential gene expression across 2481 pairwise comparisons from 602 curated GEO Datasets, and we convert differential expression frequencies to DE prior probabilities. Genes with high DE prior probabilities show enrichment in cell growth and death, signal transduction, and cancer-related biological pathways, while genes with low prior probabilities were enriched in sensory system pathways. We then applied GEOlimma to four differential expression comparisons within two human disease datasets and performed differential expression, feature selection, and supervised classification analyses. Our results suggest that use of GEOlimma provides greater experimental power to detect DE genes compared to Limma, due to its increased effective sample size. Furthermore, in a supervised classification analysis using GEOlimma as a feature selection method, we observed similar or better classification performance than Limma given small, noisy subsets of an asthma dataset. Conclusions Our results demonstrate that GEOlimma is a more effective method for differential gene expression and feature selection analyses compared to the standard Limma method. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.

2019 ◽  
Author(s):  
Liangqun Lu ◽  
Kevin A. Townsend ◽  
Bernie J. Daigle

AbstractBackgroundDifferential expression and feature selection analyses are essential steps for the development of accurate diagnostic/prognostic classifiers of complicated human diseases using transcriptomics data. These steps are particularly challenging due to the curse of dimensionality and the presence of technical and biological noise. A promising strategy for overcoming these challenges is the incorporation of pre-existing transcriptomics data in the identification of differentially expressed (DE) genes. This approach has the potential to improve the quality of selected genes, increase classification performance, and enhance biological interpretability. While a number of methods have been developed that use pre-existing data for differential expression analysis, existing methods do not leverage the identities of experimental conditions to create a robust metric for identifying DE genes.ResultsIn this study, we propose a novel differential expression and feature selection method—GEOlimma—which combines pre-existing microarray data from the Gene Expression Omnibus (GEO) with the widely-applied Limma method for differential expression analysis. We first quantify differential gene expression across 2481 pairwise comparisons from 602 curated GEO Datasets, and we convert differential expression frequencies to DE prior probabilities. Genes with high DE prior probabilities show enrichment in cell growth and death, signal transduction, and cancer-related biological pathways, while genes with low prior probabilities were enriched in sensory system pathways. We then applied GEOlimma to four differential expression comparisons within two human disease datasets and performed differential expression, feature selection, and supervised classification analyses. Our results suggest that use of GEOlimma provides greater experimental power to detect DE genes compared to Limma, due to its increased effective sample size. Furthermore, in a supervised classification analysis using GEOlimma as a feature selection method, we observed similar or better classification performance than Limma given small, noisy subsets of an asthma dataset.ConclusionsOur results demonstrate that GEOlimma is a more effective method for differential gene expression and feature selection analyses compared to the standard Limma method. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.


2014 ◽  
Vol 2014 ◽  
pp. 1-8 ◽  
Author(s):  
Yan Guo ◽  
Shilin Zhao ◽  
Fei Ye ◽  
Quanhu Sheng ◽  
Yu Shyr

Background. After a decade of microarray technology dominating the field of high-throughput gene expression profiling, the introduction of RNAseq has revolutionized gene expression research. While RNAseq provides more abundant information than microarray, its analysis has proved considerably more complicated. To date, no consensus has been reached on the best approach for RNAseq-based differential expression analysis. Not surprisingly, different studies have drawn different conclusions as to the best approach to identify differentially expressed genes based upon their own criteria and scenarios considered. Furthermore, the lack of effective quality control may lead to misleading results interpretation and erroneous conclusions. To solve these aforementioned problems, we propose a simple yet safe and practical rank-sum approach for RNAseq-based differential gene expression analysis named MultiRankSeq. MultiRankSeq first performs quality control assessment. For data meeting the quality control criteria, MultiRankSeq compares the study groups using several of the most commonly applied analytical methods and combines their results to generate a new rank-sum interpretation. MultiRankSeq provides a unique analysis approach to RNAseq differential expression analysis. MultiRankSeq is written in R, and it is easily applicable. Detailed graphical and tabular analysis reports can be generated with a single command line.


2017 ◽  
Author(s):  
Henry Han

AbstractRNA-seq data are challenging existing omics data analytics for its volume and complexity. Although quite a few computational models were proposed from different standing points to conduct differential expression (D.E.) analysis, almost all these methods do not provide a rigorous feature selection for high-dimensional RNA-seq count data. Instead, most or even all genes are invited into differential calls no matter they have real contributions to data variations or not. Thus, it would inevitably affect the robustness of D.E. analysis and lead to the increase of false positive ratios.In this study, we presented a novel feature selection method: nonnegative singular value approximation (NSVA) to enhance RNA-seq differential expression analysis by taking advantage of RNA-seq count data’s non-negativity. As a variance-based feature selection method, it selects genes according to its contribution to the first singular value direction of input data in a data-driven approach. It demonstrates robustness to depth bias and gene length bias in feature selection in comparison with its five peer methods. Combining with state-of-the-art RNA-seq differential expression analysis, it contributes to enhancing differential expression analysis by lowering false discovery rates caused by the biases. Furthermore, we demonstrated the effectiveness of the proposed feature selection by proposing a data-driven differential expression analysis: NSVA-seq, besides conducting network marker discovery.


2018 ◽  
Author(s):  
Marina Suhorutshenko ◽  
Viktorija Kukushkina ◽  
Agne Velthut-Meikas ◽  
Signe Altmäe ◽  
Maire Peters ◽  
...  

AbstractSTUDY QUESTIONDoes cellular composition of the endometrial biopsy affect the gene expression profile of endometrial whole-tissue samples?SUMMARY ANSWERThe differences in epithelial and stromal cell proportions in endome-trial biopsies modify whole-tissue gene expression profiles, and also affect the results of differential expression analysis.WHAT IS ALREADY KNOWNEach cell type has its unique gene expression profile. The proportions of epithelial and stromal cells vary in endometrial tissue during the menstrual cycle, along with individual and technical variation due to the way and tools used to obtain the tissue biopsy.STUDY DESIGN, SIZE, DURATIONUsing cell-population specific transcriptome data and computational deconvolution approach, we estimated the epithelial and stromal cell proportions in whole-tissue biopsies taken during early secretory and mid-secretory phases. The estimated cellular proportions were used as covariates in whole-tissue differential gene expression analysis. Endometrial transcriptomes before and after deconvolution were compared and analysed in biological context.PARTICIPANTS/MATERIAL, SETTING, METHODSPaired early- and mid-secretory endometrial biopsies were obtained from thirty-five healthy, regularly cycling, fertile volunteers, aged 23 to 36 years, and analysed by RNA sequencing. Differential gene expression analysis was performed using two approaches. In one of them, computational deconvolution was applied as an intermediate step to adjust for epithelial and stromal cells’ proportions in endometrial biopsy. The results were then compared to conventional differential expression analysis.MAIN RESULTS AND THE ROLE OF CHANCEThe estimated average proportions of stromal and epithelial cells in early secretory phase were 65% and 35%, and during mid-secre-tory phase 46% and 54%, respectively, that correlated well with the results of histological evaluation (r=0.88, p=1.1×10−6). Endometrial tissue transcriptomic analysis showed that approximately 26% of transcripts (n=946) differentially expressed in receptive endometrium in cell-type unadjusted analysis also remain differentially expressed after adjustment for biopsy cellular composition. However, the other 74% (n=2,645) become statistically non-significant after adjustment for biopsy cellular composition, underlining the impact of tissue heterogeneity on differential expression analysis. The results suggest new mechanisms involved in endometrial maturation involving genes like LINC01320, SLC8A1 and GGTA1P, described for the first time in context of endometrial receptivity.LIMITATIONS, REASONS FOR CAUTIONOnly dominant endometrial cell types were considered in gene expression profile deconvolution; however, other less frequent endometrial cell types also contribute to the whole-tissue gene expression profile.WIDER IMPLICATIONS OF THE FINDINGSThe better understanding of molecular processes during transition from pre-receptive to receptive endometrium serves to improve the effectiveness and personalization of assisted reproduction protocols. Biopsy cellular composition should be taken into account in future endometrial ‘omics’ studies, where tissue heterogeneity could potentially influence the results.TRIAL REGISTRATION NON/A


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Matthew Chung ◽  
Vincent M. Bruno ◽  
David A. Rasko ◽  
Christina A. Cuomo ◽  
José F. Muñoz ◽  
...  

AbstractAdvances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.


2016 ◽  
Vol 36 (suppl_1) ◽  
Author(s):  
Elisa C Maruko ◽  
Hao Xu ◽  
Sushma Kaul ◽  
Brian J Capaldo ◽  
Nathalie Pamir ◽  
...  

Atherosclerosis is a disease of both lipids and inflammatory immune cells. More specifically, elevated plasma levels of low-density lipoproteins (LDL) leads to migration of circulating monocytes into the artery wall. Lipid loaded monocyte cells subsequently proliferate in the arterial walls becoming macrophage foam cells; a hallmark of atherosclerotic lesions. A proposed mechanism of the protective effects of high-density lipoprotein (HDL) is apolipoprotein A-I (apo A-I) acting as a mediator of cholesterol efflux and subsequent foam cell regression. To better understand the biological changes stimulated by apo A-I treatment, differential expression analysis of microarray data was performed on spleen cells from apo A-I treated mice. LDL receptor null (LDLr -/- ) and LDL receptor and apo A-I null (LDLr -/- , apoA-I -/- ) mice were fed a western diet consisting of 0.2% cholesterol and 42% of calories as fat for 12 weeks. After 6 weeks of diet, a subset of mice for each genotype was subcutaneously injected with 200 micrograms of apo A-I 3 times a week for the remaining 6 weeks. The control group mice were subcutaneously injected with 200 micrograms of saline or BSA. Spleen cell RNA was isolated, purified, and analyzed for differential expression analysis using Illumina BeadArray Microarray Technology Analysis. Individual gene expression analysis for LDLr -/- , apoA-I -/- apo A-I treated mice showed 281 significantly differentially expressed genes compared to BSA treated mice. LDLr -/- A-I treated mice had 1502. Of the significant genes, 189 intersected across both genotypes. LDLr -/- , apoA-I -/- A-I mice showed 73 up-regulated and 116 down-regulated genes. Similarly, LDLr -/- A-I mice had 71 up-regulated and 118 down-regulated. One-directional Gene Set Enrichment Analysis (GSEA) of LDLr -/- , apoA-I -/- A-I mice revealed 49 significant pathways while a total of 63 were found for LDLr -/- . Of these pathways, 21 were up-regulated and 13 were down-regulated in both genotypes. Eight of the top 10 most significant up-regulated pathways in both genotypes were immune cell related. Their functions involve receptor, adhesion, and chemokine signaling. Overall, preliminary analysis suggests A-I treatment induces similar gene expression changes across different genotypes.


2015 ◽  
Vol 9s3 ◽  
pp. BBI.S29470 ◽  
Author(s):  
Mikhail G. Dozmorov ◽  
Nicolas Dominguez ◽  
Krista Bean ◽  
Susan R. Macwana ◽  
Virginia Roberts ◽  
...  

Systemic lupus erythematosus (SLE) is an autoimmune disease characterized by complex interplay among immune cell types. SLE activity is experimentally assessed by several blood tests, including gene expression profiling of heterogeneous populations of cells in peripheral blood. To better understand the contribution of different cell types in SLE pathogenesis, we applied the two methods in cell-type-specific differential expression analysis, csSAM and DSection, to identify cell-type-specific gene expression differences in heterogeneous gene expression measures obtained using RNA-seq technology. We identified B-cell-, monocyte-, and neutrophil-specific gene expression differences. Immunoglobulin-coding gene expression was altered in B-cells, while a ribosomal signature was prominent in monocytes. On the contrary, genes differentially expressed in the heterogeneous mixture of cells did not show any functional enrichment. Our results identify antigen binding and structural constituents of ribosomes as functions altered by B-cell- and monocyte-specific gene expression differences, respectively. Finally, these results position both csSAM and DSection methods as viable techniques for cell-type-specific differential expression analysis, which may help uncover pathogenic, cell-type-specific processes in SLE.


F1000Research ◽  
2017 ◽  
Vol 6 ◽  
pp. 2010 ◽  
Author(s):  
Monther Alhamdoosh ◽  
Charity W. Law ◽  
Luyi Tian ◽  
Julie M. Sheridan ◽  
Milica Ng ◽  
...  

Gene set enrichment analysis is a popular approach for prioritising the biological processes perturbed in genomic datasets. The Bioconductor project hosts over 80 software packages capable of gene set analysis. Most of these packages search for enriched signatures amongst differentially regulated genes to reveal higher level biological themes that may be missed when focusing only on evidence from individual genes. With so many different methods on offer, choosing the best algorithm and visualization approach can be challenging. The EGSEA package solves this problem by combining results from up to 12 prominent gene set testing algorithms to obtain a consensus ranking of biologically relevant results.This workflow demonstrates how EGSEA can extend limma-based differential expression analyses for RNA-seq and microarray data using experiments that profile 3 distinct cell populations important for studying the origins of breast cancer. Following data normalization and set-up of an appropriate linear model for differential expression analysis, EGSEA builds gene signature specific indexes that link a wide range of mouse or human gene set collections obtained from MSigDB, GeneSetDB and KEGG to the gene expression data being investigated. EGSEA is then configured and the ensemble enrichment analysis run, returning an object that can be queried using several S4 methods for ranking gene sets and visualizing results via heatmaps, KEGG pathway views, GO graphs, scatter plots and bar plots. Finally, an HTML report that combines these displays can fast-track the sharing of results with collaborators, and thus expedite downstream biological validation. EGSEA is simple to use and can be easily integrated with existing gene expression analysis pipelines for both human and mouse data.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 1444
Author(s):  
Charity W. Law ◽  
Kathleen Zeglinski ◽  
Xueyi Dong ◽  
Monther Alhamdoosh ◽  
Gordon K. Smyth ◽  
...  

Differential expression analysis of genomic data types, such as RNA-sequencing experiments, use linear models to determine the size and direction of the changes in gene expression. For RNA-sequencing, there are several established software packages for this purpose accompanied with analysis pipelines that are well described. However, there are two crucial steps in the analysis process that can be a stumbling block for many -- the set up an appropriate model via design matrices and the set up of comparisons of interest via contrast matrices. These steps are particularly troublesome because an extensive catalogue for design and contrast matrices does not currently exist. One would usually search for example case studies across different platforms and mix and match the advice from those sources to suit the dataset they have at hand. This article guides the reader through the basics of how to set up design and contrast matrices. We take a practical approach by providing code and graphical representation of each case study, starting with simpler examples (e.g. models with a single explanatory variable) and move onto more complex ones (e.g. interaction models, mixed effects models, higher order time series and cyclical models). Although our work has been written specifically with a limma-style pipeline in mind, most of it is also applicable to other software packages for differential expression analysis, and the ideas covered can be adapted to data analysis of other high-throughput technologies. Where appropriate, we explain the interpretation and differences between models to aid readers in their own model choices. Unnecessary jargon and theory is omitted where possible so that our work is accessible to a wide audience of readers, from beginners to those with experience in genomics data analysis.


Sign in / Sign up

Export Citation Format

Share Document