The mouse Gene Expression Database (GXD): 2021 update

Abstract The Gene Expression Database (GXD; www.informatics.jax.org/expression.shtml) is an extensive and well-curated community resource of mouse developmental gene expression information. For many years, GXD has collected and integrated data from RNA in situ hybridization, immunohistochemistry, RT-PCR, northern blot, and western blot experiments through curation of the scientific literature and by collaborations with large-scale expression projects. Since our last report in 2019, we have continued to acquire these classical types of expression data; developed a searchable index of RNA-Seq and microarray experiments that allows users to quickly and reliably find specific mouse expression studies in ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) and GEO (https://www.ncbi.nlm.nih.gov/geo/); and expanded GXD to include RNA-Seq data. Uniformly processed RNA-Seq data are imported from the EBI Expression Atlas and then integrated with the other types of expression data in GXD, and with the genetic, functional, phenotypic and disease-related information in Mouse Genome Informatics (MGI). This integration has made the RNA-Seq data accessible via GXD’s enhanced searching and filtering capabilities. Further, we have embedded the Morpheus heat map utility into the GXD user interface to provide additional tools for display and analysis of RNA-Seq data, including heat map visualization, sorting, filtering, hierarchical clustering, nearest neighbors analysis and visual enrichment.

Download Full-text

QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btz692 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1143-1149 ◽

Cited By ~ 9

Author(s):

Juan Xie ◽

Anjun Ma ◽

Yu Zhang ◽

Bingqiang Liu ◽

Sha Cao ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Gaussian Model ◽

Functional Gene ◽

Superior Performance ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Gene Modules

Abstract Motivation The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed. Results We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq. Availability and implementation The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LSTrAP-Kingdom: an automated pipeline to generate annotated gene expression atlases for kingdoms of life

Bioinformatics ◽

10.1093/bioinformatics/btab168 ◽

2021 ◽

Author(s):

William Goh1 ◽

Marek Mutwil1

Keyword(s):

Gene Expression ◽

Large Scale ◽

Supplementary Information ◽

Expression Data ◽

Supplementary Data ◽

Rna Seq ◽

Analysis Pipeline ◽

Study Gene Expression ◽

Automated Pipeline ◽

Bacteria And Fungi

Abstract Motivation There are now more than two million RNA sequencing experiments for plants, animals, bacteria and fungi publicly available, allowing us to study gene expression within and across species and kingdoms. However, the tools allowing the download, quality control and annotation of this data for more than one species at a time are currently missing. Results To remedy this, we present the Large-Scale Transcriptomic Analysis Pipeline in Kingdom of Life (LSTrAP-Kingdom) pipeline, which we used to process 134,521 RNA-seq samples, achieving ∼12,000 processed samples per day. Our pipeline generated quality-controlled, annotated gene expression matrices that rival the manually curated gene expression data in identifying functionally-related genes. Availability LSTrAP-Kingdom is available from: https://github.com/wirriamm/plants-pipeline and is fully implemented in Python and Bash. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An integrated approach in gene-expression landscape profiling to identify housekeeping and tissue-specific genes in cattle

Animal Production Science ◽

10.1071/an20689 ◽

2021 ◽

Vol 61 (16) ◽

pp. 1643

Author(s):

Peng Li ◽

Yun Zhu ◽

Xiaolong Kang ◽

Xingang Dan ◽

Yun Ma ◽

...

Keyword(s):

Gene Expression ◽

Expression Analysis ◽

Large Scale ◽

Gene Expression Analysis ◽

Housekeeping Genes ◽

Analysis Tool ◽

Expression Data ◽

Rna Seq ◽

Web Based ◽

Tissue Specific

Context High-throughput transcriptome sequencing (RNA-Seq) has been widely applied in cattle studies. Public databases such as the National Center for Biotechnology Information (NCBI) contain large collections of gene expression data from various cattle tissues that can be used in gene expression analysis research Aims This study was conducted to investigate patterns of transcriptome variation across tissues of cattle through large-scale identification of housekeeping genes (i.e. those crucial to maintaining basic cellular activity) and tissue-specific genes in cattle tissues. Methods Using data available in the NCBI Sequence Read Archive database, we analysed 1377 transcriptome data sequences from 60 bovine tissue types, identified tissue-specific and housekeeping genes, and set up a web-based bovine gene expression analysis tool. Key results We found 101 genes widely expressed in almost all tissue and screened out five housekeeping genes: RPL35A, eIF4A2, GAPDH, IPO5 and PAK2. Focusing on 12 major organs, we found 861 genes specifically expressing in these tissues. Furthermore, 187 significantly differentially expressed genes were found among six types of muscle tissues. All expression data were made available at our new website http://cattleExp.org, which can be freely accessed for future gene expression analyses. Conclusions The housekeeping genes and tissue-specific genes identified will provide more information for researchers studying gene expression in cattle. Implications The web-based cattle gene expression analysis tool will make it easy for researchers to access large public datasets. Users can easily access all publicly available RNA data and upload their own RNA-Seq data.

Download Full-text

LSTrAP-Kingdom: an automated pipeline to generate annotated gene expression atlases for kingdoms of life

10.1101/2021.01.23.427930 ◽

2021 ◽

Author(s):

William Goh ◽

Marek Mutwil

Keyword(s):

Gene Expression ◽

Quality Control ◽

Gene Expression Data ◽

Large Scale ◽

Expression Data ◽

Rna Seq ◽

Analysis Pipeline ◽

Study Gene Expression ◽

Automated Pipeline ◽

Bacteria And Fungi

AbstractSummaryThere are now more than two million RNA sequencing experiments for plants, animals, bacteria and fungi publicly available, allowing us to study gene expression within and across species and kingdoms. However, the tools allowing the download, quality control and annotation of this data for more than one species at a time are currently missing. To remedy this, we present the Large-Scale Transcriptomic Analysis Pipeline in Kingdom of Life (LSTrAP-Kingdom) pipeline, which we used to process 134,521 RNA-seq samples, achieving ~12,000 processed samples per day. Our pipeline generated quality-controlled, annotated gene expression matrices that rival the manually curated gene expression data in identifying functionally-related genes.Availability and implementationLSTrAP-Kingdom is available from: https://github.com/wirriamm/plants-pipeline and is fully implemented in Python and Bash.

Download Full-text

Graph Convolutional Network for Drug Response Prediction Using Gene Expression Data

Mathematics ◽

10.3390/math9070772 ◽

2021 ◽

Vol 9 (7) ◽

pp. 772

Author(s):

Seonghun Kim ◽

Seockhun Bae ◽

Yinhua Piao ◽

Kyuri Jo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Drug Response ◽

Response Prediction ◽

Biological Data ◽

Expression Data ◽

Convolutional Network ◽

Essential Information ◽

Protein Protein Interaction

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.

Download Full-text

GENE DISCOVERY METHODS FROM LARGE-SCALE GENE EXPRESSION DATA

Quantum Bio-Informatics III ◽

10.1142/9789814304061_0040 ◽

2010 ◽

Author(s):

AKIFUMI SHIMIZU ◽

KENTARO YANO

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Gene Discovery ◽

Expression Data

Download Full-text

Comparing RNA-Seq and microarray gene expression data in two zones of the Arabidopsis root apex relevant to spaceflight

Applications in Plant Sciences ◽

10.1002/aps3.1197 ◽

2018 ◽

Vol 6 (11) ◽

pp. e01197 ◽

Cited By ~ 3

Author(s):

Aparna Krishnamurthy ◽

Robert J. Ferl ◽

Anna-Lisa Paul

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Root Apex ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Rna Seq ◽

Microarray Gene Expression ◽

Arabidopsis Root ◽

Microarray Gene

Download Full-text

Gene Expression Imputation with Generative Adversarial Imputation Nets

10.1101/2020.06.09.141689 ◽

2020 ◽

Author(s):

Ramon Viñas ◽

Tiago Azevedo ◽

Eric R. Gamazon ◽

Pietro Liò

Keyword(s):

Gene Expression ◽

Large Scale ◽

Biological Significance ◽

Predictive Performance ◽

Cost Effective ◽

Rna Seq ◽

Comprehensive Collection ◽

Genomic Studies ◽

Biological Discovery ◽

Cancer Types

AbstractA question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we present GAIN-GTEx, a method for gene expression imputation based on Generative Adversarial Imputation Networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We compare our model to several standard and state-of-the-art imputation methods and show that GAIN-GTEx is significantly superior in terms of predictive performance and runtime. Furthermore, our results indicate strong generalisation on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.

Download Full-text

LSTrAP-Crowd: Prediction of novel components of bacterial ribosomes with crowd-sourced analysis of RNA sequencing data

10.1101/2020.04.20.005249 ◽

2020 ◽

Author(s):

Benedict Hew ◽

Qiao Wen Tan ◽

William Goh ◽

Jonathan Wei Xiong Ng ◽

Kenny Koh ◽

...

Keyword(s):

Gene Expression ◽

Protein Synthesis ◽

Rna Sequencing ◽

Gene Expression Data ◽

Large Scale ◽

Bacterial Resistance ◽

Expression Data ◽

Sequencing Data ◽

Novel Proteins ◽

Novel Antibiotics

AbstractBacterial resistance to antibiotics is a growing problem that is projected to cause more deaths than cancer in 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the bacterial ribosomes, proteins that are involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. The data can be used to identify other vulnerabilities or bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowdsourced.

Download Full-text

SOMDE: A scalable method for identifying spatially variable genes with self-organizing map

10.1101/2020.12.10.419549 ◽

2020 ◽

Author(s):

Minsheng Hao ◽

Kui Hua ◽

Xuegong Zhang

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Patterns ◽

Self Organizing Map ◽

Expression Data ◽

Spatial Expression ◽

Variable Expression ◽

Sequencing Technologies ◽

Physical Context ◽

Variable Genes

AbstractRecent developments of spatial transcriptomic sequencing technologies provide powerful tools for understanding cells in the physical context of tissue micro-environments. A fundamental task in spatial gene expression analysis is to identify genes with spatially variable expression patterns, or spatially variable genes (SVgenes). Several computational methods have been developed for this task. Their high computational complexity limited their scalability to the latest and future large-scale spatial expression data.We present SOMDE, an efficient method for identifying SVgenes in large-scale spatial expression data. SOMDE uses selforganizing map (SOM) to cluster neighboring cells into nodes, and then uses a Gaussian Process to fit the node-level spatial gene expression to identify SVgenes. Experiments show that SOMDE is about 5-50 times faster than existing methods with comparable results. The adjustable resolution of SOMDE makes it the only method that can give results in ~5 minutes in large datasets of more than 20,000 sequencing sites. SOMDE is available as a python package on PyPI at https://pypi.org/project/somde.

Download Full-text