scholarly journals LSTrAP-Kingdom: an automated pipeline to generate annotated gene expression atlases for kingdoms of life

Author(s):  
William Goh1 ◽  
Marek Mutwil1

Abstract Motivation There are now more than two million RNA sequencing experiments for plants, animals, bacteria and fungi publicly available, allowing us to study gene expression within and across species and kingdoms. However, the tools allowing the download, quality control and annotation of this data for more than one species at a time are currently missing. Results To remedy this, we present the Large-Scale Transcriptomic Analysis Pipeline in Kingdom of Life (LSTrAP-Kingdom) pipeline, which we used to process 134,521 RNA-seq samples, achieving ∼12,000 processed samples per day. Our pipeline generated quality-controlled, annotated gene expression matrices that rival the manually curated gene expression data in identifying functionally-related genes. Availability LSTrAP-Kingdom is available from: https://github.com/wirriamm/plants-pipeline and is fully implemented in Python and Bash. Supplementary information Supplementary data are available at Bioinformatics online.

2021 ◽  
Author(s):  
William Goh ◽  
Marek Mutwil

AbstractSummaryThere are now more than two million RNA sequencing experiments for plants, animals, bacteria and fungi publicly available, allowing us to study gene expression within and across species and kingdoms. However, the tools allowing the download, quality control and annotation of this data for more than one species at a time are currently missing. To remedy this, we present the Large-Scale Transcriptomic Analysis Pipeline in Kingdom of Life (LSTrAP-Kingdom) pipeline, which we used to process 134,521 RNA-seq samples, achieving ~12,000 processed samples per day. Our pipeline generated quality-controlled, annotated gene expression matrices that rival the manually curated gene expression data in identifying functionally-related genes.Availability and implementationLSTrAP-Kingdom is available from: https://github.com/wirriamm/plants-pipeline and is fully implemented in Python and Bash.


2019 ◽  
Vol 36 (4) ◽  
pp. 1143-1149 ◽  
Author(s):  
Juan Xie ◽  
Anjun Ma ◽  
Yu Zhang ◽  
Bingqiang Liu ◽  
Sha Cao ◽  
...  

Abstract Motivation The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed. Results We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq. Availability and implementation The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Cynthia Z Ma ◽  
Michael R Brent

Abstract Motivation The activity of a transcription factor (TF) in a sample of cells is the extent to which it is exerting its regulatory potential. Many methods of inferring TF activity from gene expression data have been described, but due to the lack of appropriate large-scale datasets, systematic and objective validation has not been possible until now. Results We systematically evaluate and optimize the approach to TF activity inference in which a gene expression matrix is factored into a condition-independent matrix of control strengths and a condition-dependent matrix of TF activity levels. We find that expression data in which the activities of individual TFs have been perturbed are both necessary and sufficient for obtaining good performance. To a considerable extent, control strengths inferred using expression data from one growth condition carry over to other conditions, so the control strength matrices derived here can be used by others. Finally, we apply these methods to gain insight into the upstream factors that regulate the activities of yeast TFs Gcr2, Gln3, Gcn4 and Msn2. Availability and implementation Evaluation code and data are available at https://doi.org/10.5281/zenodo.4050573. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (12) ◽  
pp. 3907-3909 ◽  
Author(s):  
Ruijia Wang ◽  
Bin Tian

Abstract Summary Most eukaryotic genes produce alternative polyadenylation (APA) isoforms. APA is dynamically regulated under different growth and differentiation conditions. Here, we present a bioinformatics package, named APAlyzer, for examining 3′UTR APA, intronic APA and gene expression changes using RNA-seq data and annotated polyadenylation sites in the PolyA_DB database. Using APAlyzer and data from the GTEx database, we present APA profiles across human tissues. Availability and implementation APAlyzer is freely available at https://bioconductor.org/packages/release/bioc/html/APAlyzer.html as an R/Bioconductor package. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 49 (D1) ◽  
pp. D924-D931 ◽  
Author(s):  
Richard M Baldarelli ◽  
Constance M Smith ◽  
Jacqueline H Finger ◽  
Terry F Hayamizu ◽  
Ingeborg J McCright ◽  
...  

Abstract The Gene Expression Database (GXD; www.informatics.jax.org/expression.shtml) is an extensive and well-curated community resource of mouse developmental gene expression information. For many years, GXD has collected and integrated data from RNA in situ hybridization, immunohistochemistry, RT-PCR, northern blot, and western blot experiments through curation of the scientific literature and by collaborations with large-scale expression projects. Since our last report in 2019, we have continued to acquire these classical types of expression data; developed a searchable index of RNA-Seq and microarray experiments that allows users to quickly and reliably find specific mouse expression studies in ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) and GEO (https://www.ncbi.nlm.nih.gov/geo/); and expanded GXD to include RNA-Seq data. Uniformly processed RNA-Seq data are imported from the EBI Expression Atlas and then integrated with the other types of expression data in GXD, and with the genetic, functional, phenotypic and disease-related information in Mouse Genome Informatics (MGI). This integration has made the RNA-Seq data accessible via GXD’s enhanced searching and filtering capabilities. Further, we have embedded the Morpheus heat map utility into the GXD user interface to provide additional tools for display and analysis of RNA-Seq data, including heat map visualization, sorting, filtering, hierarchical clustering, nearest neighbors analysis and visual enrichment.


2018 ◽  
Author(s):  
Brandon Monier ◽  
Adam McDermaid ◽  
Jing Zhao ◽  
Anne Fennell ◽  
Qin Ma

AbstractMotivationNext-Generation Sequencing has made available much more large-scale genomic and transcriptomic data. Studies with RNA-sequencing (RNA-seq) data typically involve generation of gene expression profiles that can be further analyzed, many times involving differential gene expression (DGE). This process enables comparison across samples of two or more factor levels. A recurring issue with DGE analyses is the complicated nature of the comparisons to be made, in which a variety of factor combinations, pairwise comparisons, and main or blocked main effects need to be tested.ResultsHere we present a tool called IRIS-DGE, which is a server-based DGE analysis tool developed using Shiny. It provides a straightforward, user-friendly platform for performing comprehensive DGE analysis, and crucial analyses that help design hypotheses and to determine key genomic features. IRIS-DGE integrates the three most commonly used R-based DGE tools to determine differentially expressed genes (DEGs) and includes numerous methods for performing preliminary analysis on user-provided gene expression information. Additionally, this tool integrates a variety of visualizations, in a highly interactive manner, for improved interpretation of preliminary and DGE analyses.AvailabilityIRIS-DGE is freely available at http://bmbl.sdstate.edu/IRIS/[email protected] informationSupplementary data are available at Bioinformatics online.


2021 ◽  
Vol 61 (16) ◽  
pp. 1643
Author(s):  
Peng Li ◽  
Yun Zhu ◽  
Xiaolong Kang ◽  
Xingang Dan ◽  
Yun Ma ◽  
...  

Context High-throughput transcriptome sequencing (RNA-Seq) has been widely applied in cattle studies. Public databases such as the National Center for Biotechnology Information (NCBI) contain large collections of gene expression data from various cattle tissues that can be used in gene expression analysis research Aims This study was conducted to investigate patterns of transcriptome variation across tissues of cattle through large-scale identification of housekeeping genes (i.e. those crucial to maintaining basic cellular activity) and tissue-specific genes in cattle tissues. Methods Using data available in the NCBI Sequence Read Archive database, we analysed 1377 transcriptome data sequences from 60 bovine tissue types, identified tissue-specific and housekeeping genes, and set up a web-based bovine gene expression analysis tool. Key results We found 101 genes widely expressed in almost all tissue and screened out five housekeeping genes: RPL35A, eIF4A2, GAPDH, IPO5 and PAK2. Focusing on 12 major organs, we found 861 genes specifically expressing in these tissues. Furthermore, 187 significantly differentially expressed genes were found among six types of muscle tissues. All expression data were made available at our new website http://cattleExp.org, which can be freely accessed for future gene expression analyses. Conclusions The housekeeping genes and tissue-specific genes identified will provide more information for researchers studying gene expression in cattle. Implications The web-based cattle gene expression analysis tool will make it easy for researchers to access large public datasets. Users can easily access all publicly available RNA data and upload their own RNA-Seq data.


2019 ◽  
Author(s):  
Debajyoti Sinha ◽  
Pradyumn Sinha ◽  
Ritwik Saha ◽  
Sanghamitra Bandyopadhyay ◽  
Debarka Sengupta

Abstract Summary DropClust leverages Locality Sensitive Hashing (LSH) to speed up clustering of large scale single cell expression data. Here we present the improved dropClust, a complete R package that is, fast, interoperable and minimally resource intensive. The new dropClust features a novel batch effect removal algorithm that allows integrative analysis of single cell RNA-seq (scRNA-seq) datasets. Availability and implementation dropClust is freely available at https://github.com/debsin/dropClust as an R package. A lightweight online version of the dropClust is available at https://debsinha.shinyapps.io/dropClust/. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Patryk Orzechowski ◽  
Artur Pańszczyk ◽  
Xiuzhen Huang ◽  
Jason H. Moore

AbstractMotivationBiclustering (called also co-clustering) is an unsupervised technique of simultaneous analysis of rows and columns of input matrix. From the first application to gene expression data, multiple algorithms have been proposed. Only a handful of them were able to provide accurate results and were fast enough to be suitable for large-scale genomic datasets.ResultsIn this paper we introduce a Bioconductor package with parallel version of UniBic biclustering algorithm: one of the most accurate biclustering methods that have been developed so far. For the convenience of usage, we have wrapped the algorithm in an R package called runibic. The package includes: (1) a couple of times faster parallel version of the original sequential algorithm,(2) muchmore efficient memory management, (3) modularity which allows to build new methods on top of the provided one, (4) integration with the modern Bioconductor packages such as SummarizedExperiment, ExpressionSetand biclust.AvailabilityThe package is implemented in R (3.4) and will be available in the new release of Bioconductor (3.6). Currently it could be downloaded from the following URL: http://github.com/athril/runibic/[email protected], [email protected] informationSupplementary informations are available in vignette of the package.


Author(s):  
Krzysztof Polański ◽  
Matthew D Young ◽  
Zhichao Miao ◽  
Kerstin B Meyer ◽  
Sarah A Teichmann ◽  
...  

Abstract Motivation Increasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. A number of methods have been developed to combine diverse datasets by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration algorithm. We illustrate the power of BBKNN on large scale mouse atlasing data, and favourably benchmark its run time against a number of competing methods. Availability and implementation BBKNN is available at https://github.com/Teichlab/bbknn, along with documentation and multiple example notebooks, and can be installed from pip. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document