scholarly journals Massive Mining of Publicly Available RNA-seq Data from Human and Mouse

2017 ◽  
Author(s):  
Alexander Lachmann ◽  
Denis Torre ◽  
Alexandra B. Keenan ◽  
Kathleen M. Jagodnik ◽  
Hyojin J. Lee ◽  
...  

RNA-sequencing (RNA-seq) is currently the leading technology for genome-wide transcript quantification. While the volume of RNA-seq data is rapidly increasing, the currently publicly available RNA-seq data is provided mostly in raw form, with small portions processed non- uniformly. This is mainly because the computational demand, particularly for the alignment step, is a significant barrier for global and integrative retrospective analyses. To address this challenge, we developed all RNA-seq and ChIP-seq sample and signature search (ARCHS4), a web resource that makes the majority of previously published RNA-seq data from human and mouse freely available at the gene count level. Such uniformly processed data enables easy integration for downstream analyses. For developing the ARCHS4 resource, all available FASTQ files from RNA-seq experiments were retrieved from the Gene Expression Omnibus (GEO) and aligned using a cloud-based infrastructure. In total 137,792 samples are accessible through ARCHS4 with 72,363 mouse and 65,429 human samples. Through efficient use of cloud resources and dockerized deployment of the sequencing pipeline, the alignment cost per sample is reduced to less than one cent. ARCHS4 is updated automatically by adding newly published samples to the database as they become available. Additionally, the ARCHS4 web interface provides intuitive exploration of the processed data through querying tools, interactive visualization, and gene landing pages that provide average expression across cell lines and tissues, top co-expressed genes, and predicted biological functions and protein-protein interactions for each gene based on prior knowledge combined with co-expression. Benchmarking the quality of these predictions, co-expression correlation data created from ARCHS4 outperforms co-expression data created from other major gene expression data repositories such as GTEx and CCLE.ARCHS4 is freely accessible at: http://amp.pharm.mssm.edu/archs4

2019 ◽  
Author(s):  
Hidemasa Bono

AbstractGene expression data have been archived as microarray and RNA-seq datasets in two public databases, Gene Expression Omnibus (GEO) and ArrayExpress (AE). In 2018, the DNA DataBank of Japan started a similar repository called the Genomic Expression Archive (GEA). These databases are useful resources for the functional interpretation of genes, but have been separately maintained and may lack RNA-seq data, while the original sequence data are available in the Sequence Read Archive (SRA).We constructed an index for those gene expression data repositories, called All Of gene Expression (AOE), to integrate publicly available gene expression data. The web interface of AOE can graphically query data in addition to the application programming interface. By collecting gene expression data from RNA-seq in the SRA, AOE also includes data not included in GEO and AE.AOE is accessible as a search tool from the GEA website and is freely available at https://aoe.dbcls.jp/.


2019 ◽  
Author(s):  
Brian B. Nadel ◽  
David Lopez ◽  
Dennis J. Montoya ◽  
Feiyang Ma ◽  
Hannah Waddel ◽  
...  

AbstractThe cell type composition of heterogeneous tissue samples can be a critical variable in both clinical and laboratory settings. However, current experimental methods of cell type quantification (e.g. cell flow cytometry) are costly, time consuming, and can introduce bias. Computational approaches that infer cell type abundance from expression data offer an alternate solution. While these methods have gained popularity, most are limited to predicting hematopoietic cell types and do not produce accurate predictions for stromal cell types. Many of these methods are also limited to particular platforms, whether RNA-seq or specific microarrays. We present the Gene Expression Deconvolution Interactive Tool (GEDIT), a tool that overcomes these limitations, compares favorably with existing methods, and provides superior versatility. Using both simulated and experimental data, we extensively evaluate the performance of GEDIT and demonstrate that it returns robust results under a wide variety of conditions. These conditions include a variety of platforms (microarray and RNA-seq), tissue types (blood and stromal), and species (human and mouse). Finally, we provide reference data from eight sources spanning a wide variety of stromal and hematopoietic types in both human and mouse. This reference database allows the user to obtain estimates for a wide variety of tissue samples without having to provide their own data. GEDIT also accepts user submitted reference data, thus allowing the estimation of any cell type or subtype, provided that reference data is available.Author SummaryThe Gene Expression Deconvolution Interactive Tool (GEDIT) is a robust and accurate tool that uses gene expression data to estimate cell type abundances. Extensive testing on a variety of tissue types and technological platforms demonstrates that GEDIT provides greater versatility than other cell type deconvolution tools. GEDIT utilizes reference data describing the expression profile of purified cell types, and we provide in the software package a library of reference matrices from various sources. GEDIT is also flexible and allows the user to supply custom reference matrices. A GUI interface for GEDIT is available at http://webtools.mcdb.ucla.edu/, and source code and reference matrices are available at https://github.com/BNadel/GEDIT.


NAR Cancer ◽  
2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Zachary V Thomas ◽  
Zhenjia Wang ◽  
Chongzhi Zang

Abstract Dysregulation of gene expression plays an important role in cancer development. Identifying transcriptional regulators, including transcription factors and chromatin regulators, that drive the oncogenic gene expression program is a critical task in cancer research. Genomic profiles of active transcriptional regulators from primary cancer samples are limited in the public domain. Here we present BART Cancer (bartcancer.org), an interactive web resource database to display the putative transcriptional regulators that are responsible for differentially regulated genes in 15 different cancer types in The Cancer Genome Atlas (TCGA). BART Cancer integrates over 10000 gene expression profiling RNA-seq datasets from TCGA with over 7000 ChIP-seq datasets from the Cistrome Data Browser database and the Gene Expression Omnibus (GEO). BART Cancer uses Binding Analysis for Regulation of Transcription (BART) for predicting the transcriptional regulators from the differentially expressed genes in cancer samples compared to normal samples. BART Cancer also displays the activities of over 900 transcriptional regulators across cancer types, by integrating computational prediction results from BART and the Cistrome Cancer database. Focusing on transcriptional regulator activities in human cancers, BART Cancer can provide unique insights into epigenetics and transcriptional regulation in cancer, and is a useful data resource for genomics and cancer research communities.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Bojun Xu ◽  
Lei Wang ◽  
Huakui Zhan ◽  
Liangbin Zhao ◽  
Yuehan Wang ◽  
...  

Objectives. Diabetic nephropathy (DN) is a major cause of end-stage renal disease (ESRD) throughout the world, and the identification of novel biomarkers via bioinformatics analysis could provide research foundation for future experimental verification and large-group cohort in DN models and patients. Methods. GSE30528, GSE47183, and GSE104948 were downloaded from Gene Expression Omnibus (GEO) database to find differentially expressed genes (DEGs). The difference of gene expression between normal renal tissues and DN renal tissues was firstly screened by GEO2R. Then, the protein-protein interactions (PPIs) of DEGs were performed by STRING database, the result was integrated and visualized via applying Cytoscape software, and the hub genes in this PPI network were selected by MCODE and topological analysis. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were carried out to determine the molecular mechanisms of DEGs involved in the progression of DN. Finally, the Nephroseq v5 online platform was used to explore the correlation between hub genes and clinical features of DN. Results. There were 64 DEGs, and 32 hub genes were identified, enriched pathways of hub genes involved in several functions and expression pathways, such as complement binding, extracellular matrix structural constituent, complement cascade related pathways, and ECM proteoglycans. The correlation analysis and subgroup analysis of 7 complement cascade-related hub genes and the clinical characteristics of DN showed that C1QA, C1QB, C3, CFB, ITGB2, VSIG4, and CLU may participate in the development of DN. Conclusions. We confirmed that the complement cascade-related hub genes may be the novel biomarkers for DN early diagnosis and targeted treatment.


2018 ◽  
Vol 7 ◽  
pp. e1279
Author(s):  
Mona Zamanian Azodi ◽  
Mostafa Rezaei-Tavirani ◽  
Mohammad Rostami-Nejad ◽  
Majid Rezaei-Tavirani

Background: Bladder cancer (BC) has remained as one of the most challenging issues in medicine. The aim of this study was to investigate the differential network analysis of stages 2 and 4 of BC to better understand the molecular pathology of these states. Materials and Methods: We chose gene expression data of GSE52519 from Gene Expression Omnibus (GEO) database analyzed by the GEO2R online tool. Cytoscape version 3.6.1 and its algorithms are the methods applied for the network construction and investigation of differentially expressed genes (DEG) in these states. Result: Our result revealed that the analysis DEGs provides useful information about a common molecular feature of stages 2 and 4 of BC. Conclusion: Consequently, the network finding revealed that more investigation about stage 2 is required to achieve an effective therapeutic protocol to block the transition from stage 2 to stage 4.[GMJ.2018;7:e1279] 


2019 ◽  
Vol 15 (2) ◽  
pp. e1006792 ◽  
Author(s):  
Brandon Monier ◽  
Adam McDermaid ◽  
Cankun Wang ◽  
Jing Zhao ◽  
Allison Miller ◽  
...  

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Paulo Rapazote-Flores ◽  
Micha Bayer ◽  
Linda Milne ◽  
Claus-Dieter Mayer ◽  
John Fuller ◽  
...  

Abstract Background The time required to analyse RNA-seq data varies considerably, due to discrete steps for computational assembly, quantification of gene expression and splicing analysis. Recent fast non-alignment tools such as Kallisto and Salmon overcome these problems, but these tools require a high quality, comprehensive reference transcripts dataset (RTD), which are rarely available in plants. Results A high-quality, non-redundant barley gene RTD and database (Barley Reference Transcripts – BaRTv1.0) has been generated. BaRTv1.0, was constructed from a range of tissues, cultivars and abiotic treatments and transcripts assembled and aligned to the barley cv. Morex reference genome (Mascher et al. Nature; 544: 427–433, 2017). Full-length cDNAs from the barley variety Haruna nijo (Matsumoto et al. Plant Physiol; 156: 20–28, 2011) determined transcript coverage, and high-resolution RT-PCR validated alternatively spliced (AS) transcripts of 86 genes in five different organs and tissue. These methods were used as benchmarks to select an optimal barley RTD. BaRTv1.0-Quantification of Alternatively Spliced Isoforms (QUASI) was also made to overcome inaccurate quantification due to variation in 5′ and 3′ UTR ends of transcripts. BaRTv1.0-QUASI was used for accurate transcript quantification of RNA-seq data of five barley organs/tissues. This analysis identified 20,972 significant differentially expressed genes, 2791 differentially alternatively spliced genes and 2768 transcripts with differential transcript usage. Conclusion A high confidence barley reference transcript dataset consisting of 60,444 genes with 177,240 transcripts has been generated. Compared to current barley transcripts, BaRTv1.0 transcripts are generally longer, have less fragmentation and improved gene models that are well supported by splice junction reads. Precise transcript quantification using BaRTv1.0 allows routine analysis of gene expression and AS.


Author(s):  
D Fumagalli ◽  
B Haibe-Kains ◽  
S Michiels ◽  
DN Brown ◽  
D Gacquer ◽  
...  

2019 ◽  
Vol 2019 ◽  
pp. 1-12
Author(s):  
Shan Lin ◽  
Zhicheng Zou ◽  
Cuibing Zhou ◽  
Hancheng Zhang ◽  
Zhiming Cai

Caterpillar fungus is a well-known fungal Chinese medicine. To reveal molecular changes during early and late stages of adenosine biosynthesis, transcriptome analysis was performed with the anamorph strain of caterpillar fungus. A total of 2,764 differentially expressed genes (DEGs) were identified (p≤0.05, |log2 Ratio| ≥ 1), of which 1,737 were up-regulated and 1,027 were down-regulated. Gene expression profiling on 4–10 d revealed a distinct shift in expression of the purine metabolism pathway. Differential expression of 17 selected DEGs which involved in purine metabolism (map00230) were validated by qPCR, and the expression trends were consistent with the RNA-Seq results. Subsequently, the predicted adenosine biosynthesis pathway combined with qPCR and gene expression data of RNA-Seq indicated that the increased adenosine accumulation is a result of down-regulation of ndk, ADK, and APRT genes combined with up-regulation of AK gene. This study will be valuable for understanding the molecular mechanisms of the adenosine biosynthesis in caterpillar fungus.


Sign in / Sign up

Export Citation Format

Share Document