scholarly journals ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Xudong Zhao ◽  
Qing Jiao ◽  
Hangyu Li ◽  
Yiming Wu ◽  
Hanxu Wang ◽  
...  
2017 ◽  
Author(s):  
Henry Han

AbstractRNA-seq data are challenging existing omics data analytics for its volume and complexity. Although quite a few computational models were proposed from different standing points to conduct differential expression (D.E.) analysis, almost all these methods do not provide a rigorous feature selection for high-dimensional RNA-seq count data. Instead, most or even all genes are invited into differential calls no matter they have real contributions to data variations or not. Thus, it would inevitably affect the robustness of D.E. analysis and lead to the increase of false positive ratios.In this study, we presented a novel feature selection method: nonnegative singular value approximation (NSVA) to enhance RNA-seq differential expression analysis by taking advantage of RNA-seq count data’s non-negativity. As a variance-based feature selection method, it selects genes according to its contribution to the first singular value direction of input data in a data-driven approach. It demonstrates robustness to depth bias and gene length bias in feature selection in comparison with its five peer methods. Combining with state-of-the-art RNA-seq differential expression analysis, it contributes to enhancing differential expression analysis by lowering false discovery rates caused by the biases. Furthermore, we demonstrated the effectiveness of the proposed feature selection by proposing a data-driven differential expression analysis: NSVA-seq, besides conducting network marker discovery.


2015 ◽  
Vol 11 (5) ◽  
pp. 1235-1240 ◽  
Author(s):  
Xi Wang ◽  
Erin J. Gardiner ◽  
Murray J. Cairns

Reference gene-based normalization of expression profiles secures consistent differential expression analysis between samples of different phenotypes or biological conditions, and facilitates comparison between experimental batches.


2019 ◽  
Author(s):  
Liangqun Lu ◽  
Kevin A. Townsend ◽  
Bernie J. Daigle

AbstractBackgroundDifferential expression and feature selection analyses are essential steps for the development of accurate diagnostic/prognostic classifiers of complicated human diseases using transcriptomics data. These steps are particularly challenging due to the curse of dimensionality and the presence of technical and biological noise. A promising strategy for overcoming these challenges is the incorporation of pre-existing transcriptomics data in the identification of differentially expressed (DE) genes. This approach has the potential to improve the quality of selected genes, increase classification performance, and enhance biological interpretability. While a number of methods have been developed that use pre-existing data for differential expression analysis, existing methods do not leverage the identities of experimental conditions to create a robust metric for identifying DE genes.ResultsIn this study, we propose a novel differential expression and feature selection method—GEOlimma—which combines pre-existing microarray data from the Gene Expression Omnibus (GEO) with the widely-applied Limma method for differential expression analysis. We first quantify differential gene expression across 2481 pairwise comparisons from 602 curated GEO Datasets, and we convert differential expression frequencies to DE prior probabilities. Genes with high DE prior probabilities show enrichment in cell growth and death, signal transduction, and cancer-related biological pathways, while genes with low prior probabilities were enriched in sensory system pathways. We then applied GEOlimma to four differential expression comparisons within two human disease datasets and performed differential expression, feature selection, and supervised classification analyses. Our results suggest that use of GEOlimma provides greater experimental power to detect DE genes compared to Limma, due to its increased effective sample size. Furthermore, in a supervised classification analysis using GEOlimma as a feature selection method, we observed similar or better classification performance than Limma given small, noisy subsets of an asthma dataset.ConclusionsOur results demonstrate that GEOlimma is a more effective method for differential gene expression and feature selection analyses compared to the standard Limma method. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Liangqun Lu ◽  
Kevin A. Townsend ◽  
Bernie J. Daigle

Abstract Background Differential expression and feature selection analyses are essential steps for the development of accurate diagnostic/prognostic classifiers of complicated human diseases using transcriptomics data. These steps are particularly challenging due to the curse of dimensionality and the presence of technical and biological noise. A promising strategy for overcoming these challenges is the incorporation of pre-existing transcriptomics data in the identification of differentially expressed (DE) genes. This approach has the potential to improve the quality of selected genes, increase classification performance, and enhance biological interpretability. While a number of methods have been developed that use pre-existing data for differential expression analysis, existing methods do not leverage the identities of experimental conditions to create a robust metric for identifying DE genes. Results In this study, we propose a novel differential expression and feature selection method—GEOlimma—which combines pre-existing microarray data from the Gene Expression Omnibus (GEO) with the widely-applied Limma method for differential expression analysis. We first quantify differential gene expression across 2481 pairwise comparisons from 602 curated GEO Datasets, and we convert differential expression frequencies to DE prior probabilities. Genes with high DE prior probabilities show enrichment in cell growth and death, signal transduction, and cancer-related biological pathways, while genes with low prior probabilities were enriched in sensory system pathways. We then applied GEOlimma to four differential expression comparisons within two human disease datasets and performed differential expression, feature selection, and supervised classification analyses. Our results suggest that use of GEOlimma provides greater experimental power to detect DE genes compared to Limma, due to its increased effective sample size. Furthermore, in a supervised classification analysis using GEOlimma as a feature selection method, we observed similar or better classification performance than Limma given small, noisy subsets of an asthma dataset. Conclusions Our results demonstrate that GEOlimma is a more effective method for differential gene expression and feature selection analyses compared to the standard Limma method. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.


2021 ◽  
Author(s):  
Eloi Schmauch ◽  
Pia Laitinen ◽  
Tiia A Turunen ◽  
Mari-Anna Vaananen ◽  
Tarja Malm ◽  
...  

MicroRNAs (miRNAs) are small RNA molecules that act as regulators of gene expression through targeted mRNA degradation. They are involved in many biological and pathophysiological processes and are widely studied as potential biomarkers and therapeutics agents for human diseases, including cardiovascular disorders. Recently discovered isoforms of miRNAs (isomiRs) exist in high quantities and are very diverse. Despite having few differences with their corresponding reference miRNAs, they display specific functions and expression profiles, across tissues and conditions. However, they are still overlooked and understudied, as we lack a comprehensive view on their condition-specific regulation and impact on differential expression analysis. Here, we show that isomiRs can have major effects on differential expression analysis results, as their expression is independent of their host miRNA genes or reference sequences. We present two miRNA-seq datasets from human umbilical vein endothelial cells, and assess isomiR expression in response to senescence and compartment-specificity (nuclear/cytosolic) under hypoxia. We compare three different methods for miRNA analysis, including isomiR-specific analysis, and show that ignoring isomiRs induces major biases in differential expression. Moreover, isomiR analysis permits higher resolution of complex signal dissection, such as the impact of hypoxia on compartment localization, and differential isomiR type enrichments between compartments. Finally, we show important distribution differences across conditions, independently of global miRNA expression signals. Our results raise concerns over the quasi exclusive use of miRNA reference sequences in miRNA-seq processing and experimental assays. We hope that our work will guide future isomiR expression studies, which will correct some biases introduced by golden standard analysis, improving the resolution of such assays and the biological significance of their downstream studies.


Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 3922-3922
Author(s):  
Moosa Qureshi ◽  
Wajid Jawaid ◽  
Fernando J Calero-Nieto ◽  
Rebecca Hannah ◽  
Sarah J Kinston ◽  
...  

Abstract Background C/EBPα plays a pivotal role in myeloid differentiation at the CMP to GMP transition point, where it interacts with other transcription factors (TFs) implicated in haematopoiesis. CEBPA mutations are common in acute myeloid leukaemia (AML), predominantly in patients with M1 and M2 French-American-British (FAB) morphological classifications, but relatively little is understood about the pre-leukaemic alterations caused by mutated CEBPA. Murine models have established N321D as a particularly potent CEBPA mutation which causes AML with high mortality (Togami et al, Experimental Hematology, 2015). We aimed to develop an inducible expression system for CEBPA N321D in a cellular model which replicates early haematopoietic progenitors, to study the effects of this mutation on gene expression profiles relevant for malignant haematopoiesis. Methods We constructed a Piggy-bac Tet-on inducible expression system which has a 2A peptide mechanism enabling simultaneous expression of both N321D and mCherry fluorescent protein from the same transcript (Fig. 1). We also constructed a control with inducible expression of mCherry. These two plasmids were then transfected into the mouse progenitor cell line Hoxb8-FL (Redeckeet al, Nature Methods, 2013), which is conditionally immortalized and models multipotent myelo-lymphoid progenitors. Single cell clones were established and selected for analysis on the basis of cell growth and mCherry fluorescence on induction. RNA was collected post-induction and without induction at 24, 48 and 72 hours in two replicates each from the N321D clone and from the empty control vector. RNA-seq data was aligned to the mouse genome using STAR aligner, processed to generate high throughput sequencing counts, and finally differential expression analysis was performed between N321D and the control. Results Differential expression analysis identified 172 downregulated and 60 upregulated genes after N321D induction. Further analysis of the 172 downregulated genes against online published datasets of gene expression (Gene Expression Commons, https://gexc.stanford.edu), revealed that 19 of these genes are normally upregulated at the CMP to GMP transition. These include genes such as Hck, Met, Hdac8 and Kdm7a which have been previously implicated in haematological malignancy and which may provide novel insights into the leukaemic process fostered by the CEBPA N321D mutation. To further validate our data, we performed unsupervised hierarchical clustering of previously published microarray data from a large collection of over 400 AML expression profiles (Verhaaket al, Haematologica, 2009) using the genes identified in our study, and found that patient samples who had predominantly FAB classifications M1 and M2 clustered together (Fig. 2A,B), as would be expected in CEBPA-mutated AML. Conclusions Our inducible expression system has the potential to provide novel insights into altered gene expression caused by induction of mutated CEBPA. In particular, our cellular model replicates an early stage of haematopoiesis, and implicates genes which were not previously known to interact with CEBPA. The importance of these genes in CEBPA N321D-mediated re-configuration of the myeloid transcriptional regulatory network requires further analysis. Disclosures No relevant conflicts of interest to declare.


2020 ◽  
Author(s):  
Diana Lobo ◽  
Raquel Godinho ◽  
John Archer

Abstract Background In the last decades, the evolution of RNA-Seq has yielded archived datasets that possess the potential for providing unprecedented inter-study insight into transcriptome evolution, once background noise has been reduced. Here we present a method to quantify intra-condition variation and to remove reference-based transcripts associated with highly variable read counts, prior to differential expression analysis. The method utilizes variation within pairwise distances between normalized read counts for each transcript across all included samples of a given condition. As a case study, we demonstrate our approach at an inter and intra-study level using RNA-seq data from brain samples of dogs, wolves, and two strains of fox (aggressive and tame) prior to performing differential expression analysis to identify common genes associated with tame behaviour. Results By applying our method, the distribution of the gene-wise dispersion estimates improved and the number of outliers detected in differential expression analysis decreased. Several genes that initially were differentially expressed in the non-filtered datasets were removed due to high intra-condition variation. Additionally, by optimizing the detection of differentially expressed transcripts, the overall number increased between dogs vs wolves and tame vs aggressive foxes when compared to the non-filtered datasets. Using these filtered sets, we found common over expressed genes in dogs and tame foxes, including those involved in brain development, neurotransmission and immunity, factors known to be involved in domestication. Conclusions We presented a method to quantify and remove intra-condition variation from RNA-seq count data and demonstrate its usage in improving the distribution of gene-wise dispersion estimates and ultimately, reduce the number of false positives in differential gene expression analysis. We provide the method as a freely available tool, to aid studies using RNA-seq to calculate and characterize the variation present within data prior to perform differential expression analysis. Additionally, we identify candidate genes involved with selection for tameness, which seems to have played a crucial role in the canine domestication.


Sign in / Sign up

Export Citation Format

Share Document