scholarly journals Importance of experimental information (metadata) for archived sequence data: case of specific gene bias due to lag time between sample harvest and RNA protection in RNA sequencing

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11875
Author(s):  
Tomoko Matsuda

Large volumes of high-throughput sequencing data have been submitted to the Sequencing Read Archive (SRA). The lack of experimental metadata associated with the data makes reuse and understanding data quality very difficult. In the case of RNA sequencing (RNA-Seq), which reveals the presence and quantity of RNA in a biological sample at any moment, it is necessary to consider that gene expression responds over a short time interval (several seconds to a few minutes) in many organisms. Therefore, to isolate RNA that accurately reflects the transcriptome at the point of harvest, raw biological samples should be processed by freezing in liquid nitrogen, immersing in RNA stabilization reagent or lysing and homogenizing in RNA lysis buffer containing guanidine thiocyanate as soon as possible. As the number of samples handled simultaneously increases, the time until the RNA is protected can increase. Here, to evaluate the effect of different lag times in RNA protection on RNA-Seq data, we harvested CHO-S cells after 3, 5, 6, and 7 days of cultivation, added RNA lysis buffer in a time course of 15, 30, 45, and 60 min after harvest, and conducted RNA-Seq. These RNA samples showed high RNA integrity number (RIN) values indicating non-degraded RNA, and sequence data from libraries prepared with these RNA samples was of high quality according to FastQC. We observed that, at the same cultivation day, global trends of gene expression were similar across the time course of addition of RNA lysis buffer; however, the expression of some genes was significantly different between the time-course samples of the same cultivation day; most of these differentially expressed genes were related to apoptosis. We conclude that the time lag between sample harvest and RNA protection influences gene expression of specific genes. It is, therefore, necessary to know not only RIN values of RNA and the quality of the sequence data but also how the experiment was performed when acquiring RNA-Seq data from the database.

Author(s):  
Meichen Dong ◽  
Aatish Thennavan ◽  
Eugene Urrutia ◽  
Yun Li ◽  
Charles M Perou ◽  
...  

Abstract Recent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.


2019 ◽  
Author(s):  
Meichen Dong ◽  
Aatish Thennavan ◽  
Eugene Urrutia ◽  
Yun Li ◽  
Charles M. Perou ◽  
...  

AbstractRecent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.


Cells ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 1161 ◽  
Author(s):  
Xifang Sun ◽  
Shiquan Sun ◽  
Sheng Yang

Estimating cell type compositions for complex diseases is an important step to investigate the cellular heterogeneity for understanding disease etiology and potentially facilitate early disease diagnosis and prevention. Here, we developed a computationally statistical method, referring to Multi-Omics Matrix Factorization (MOMF), to estimate the cell-type compositions of bulk RNA sequencing (RNA-seq) data by leveraging cell type-specific gene expression levels from single-cell RNA sequencing (scRNA-seq) data. MOMF not only directly models the count nature of gene expression data, but also effectively accounts for the uncertainty of cell type-specific mean gene expression levels. We demonstrate the benefits of MOMF through three real data applications, i.e., Glioblastomas (GBM), colorectal cancer (CRC) and type II diabetes (T2D) studies. MOMF is able to accurately estimate disease-related cell type proportions, i.e., oligodendrocyte progenitor cells and macrophage cells, which are strongly associated with the survival of GBM and CRC, respectively.


2020 ◽  
Author(s):  
Kwangbom Choi ◽  
Hao He ◽  
Daniel M. Gatti ◽  
Vivek M. Philip ◽  
Narayanan Raghupathy ◽  
...  

AbstractMulti-parent populations (MPPs), genetically segregating model systems derived from two or more inbred founder strains, are widely used in biomedical and agricultural research. Gene expression profiling by direct RNA sequencing (RNA-Seq) is commonly applied to MPPs to investigate gene expression regulation and to identify candidate genes. In genetically diverse populations, including most MPPs, quantification of gene expression is improved when the RNA-Seq reads are aligned to individualized transcriptomes that incorporate known polymorphic loci. However, the process of constructing and analyzing individual genomes can be computationally demanding and error prone. We propose a new approach, genome reconstruction by RNA-Seq (GBRS), that relies on simultaneous alignment of RNA-Seq reads to the founder strain transcriptomes. GBRS can reconstruct the diploid genome of each individual and quantify both total and allele-specific gene expression. We demonstrate that GBRS performs as well as methods that rely on high-density genotyping arrays to reconstruct the founder haplotype mosaic of MPP individuals. Using GBRS in addition to other genotyping methods provides quality control for detecting sample mix-ups and improves power to detect expression quantitative trait loci. GBRS software is freely available at https://github.com/churchill-lab/gbrs.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Verônica R. de Melo Costa ◽  
Julianus Pfeuffer ◽  
Annita Louloupi ◽  
Ulf A. V. Ørom ◽  
Rosario M. Piro

Abstract Background Introns are generally removed from primary transcripts to form mature RNA molecules in a post-transcriptional process called splicing. An efficient splicing of primary transcripts is an essential step in gene expression and its misregulation is related to numerous human diseases. Thus, to better understand the dynamics of this process and the perturbations that might be caused by aberrant transcript processing it is important to quantify splicing efficiency. Results Here, we introduce SPLICE-q, a fast and user-friendly Python tool for genome-wide SPLICing Efficiency quantification. It supports studies focusing on the implications of splicing efficiency in transcript processing dynamics. SPLICE-q uses aligned reads from strand-specific RNA-seq to quantify splicing efficiency for each intron individually and allows the user to select different levels of restrictiveness concerning the introns’ overlap with other genomic elements such as exons of other genes. We applied SPLICE-q to globally assess the dynamics of intron excision in yeast and human nascent RNA-seq. We also show its application using total RNA-seq from a patient-matched prostate cancer sample. Conclusions Our analyses illustrate that SPLICE-q is suitable to detect a progressive increase of splicing efficiency throughout a time course of nascent RNA-seq and it might be useful when it comes to understanding cancer progression beyond mere gene expression levels. SPLICE-q is available at: https://github.com/vrmelo/SPLICE-q


2021 ◽  
Vol 36 (Supplement_1) ◽  
Author(s):  
Floranne Boulogne ◽  
Laura Claus ◽  
Henry Wiersma ◽  
Roy Oelen ◽  
Floor Schukking ◽  
...  

Abstract Background and Aims Genetic testing in patients with suspected hereditary kidney disease does not always reveal the genetic cause for the patient's disorder. Potentially pathogenic variants can reside in genes that are not known to be involved in kidney disease, which makes it difficult to prioritize and interpret the relevance of these variants. As such, there is a clear need for methods that predict the phenotypic consequences of gene expression in a way that is as unbiased as possible. To help identify candidate genes we have developed KidneyNetwork, in which tissue-specific expression is utilized to predict kidney-specific gene functions. Method We combined gene co-expression in 878 publicly available kidney RNA-sequencing samples with the co-expression of a multi-tissue RNA-sequencing dataset of 31,499 samples to build KidneyNetwork. The expression patterns were used to predict which genes have a kidney-related function, and which (disease) phenotypes might be caused when these genes are mutated. By integrating the information from the HPO database, in which known phenotypic consequences of disease genes are annotated, with the gene co-expression network we obtained prediction scores for each gene per HPO term. As proof of principle, we applied KidneyNetwork to prioritize variants in exome-sequencing data from 13 kidney disease patients without a genetic diagnosis. Results We assessed the prediction performance of KidneyNetwork by comparing it to GeneNetwork, a multi-tissue co-expression network we previously developed. In KidneyNetwork, we observe a significantly improved prediction accuracy of kidney-related HPO-terms, as well as an increase in the total number of significantly predicted kidney-related HPO-terms (figure 1). To examine its clinical utility, we applied KidneyNetwork to 13 patients with a suspected hereditary kidney disease without a genetic diagnosis. Based on the HPO terms “Renal cyst” and “Hepatic cysts”, combined with a list of potentially damaging variants in one of the undiagnosed patients with mild ADPKD/PCLD, we identified ALG6 as a new candidate gene. ALG6 bears a high resemblance to other genes implicated in this phenotype in recent years. Through the 100,000 Genomes Project and collaborators we identified three additional patients with kidney and/or liver cysts carrying a suspected deleterious variant in ALG6. Conclusion We present KidneyNetwork, a kidney specific co-expression network that accurately predicts what genes have kidney-specific functions and may result in kidney disease. Gene-phenotype associations of genes unknown for kidney-related phenotypes can be predicted by KidneyNetwork. We show the added value of KidneyNetwork by applying it to exome sequencing data of kidney disease patients without a molecular diagnosis and consequently we propose ALG6 as a promising candidate gene. KidneyNetwork can be applied to clinically unsolved kidney disease cases, but it can also be used by researchers to gain insight into individual genes to better understand kidney physiology and pathophysiology. Acknowledgments This research was made possible through access to the data and findings generated by the 100,000 Genomes Project; http://www.genomicsengland.co.uk.


2021 ◽  
Author(s):  
Dennis A Sun ◽  
Nipam H Patel

AbstractEmerging research organisms enable the study of biology that cannot be addressed using classical “model” organisms. The development of novel data resources can accelerate research in such animals. Here, we present new functional genomic resources for the amphipod crustacean Parhyale hawaiensis, facilitating the exploration of gene regulatory evolution using this emerging research organism. We use Omni-ATAC-Seq, an improved form of the Assay for Transposase-Accessible Chromatin coupled with next-generation sequencing (ATAC-Seq), to identify accessible chromatin genome-wide across a broad time course of Parhyale embryonic development. This time course encompasses many major morphological events, including segmentation, body regionalization, gut morphogenesis, and limb development. In addition, we use short- and long-read RNA-Seq to generate an improved Parhyale genome annotation, enabling deeper classification of identified regulatory elements. We leverage a variety of bioinformatic tools to discover differential accessibility, predict nucleosome positioning, infer transcription factor binding, cluster peaks based on accessibility dynamics, classify biological functions, and correlate gene expression with accessibility. Using a Minos transposase reporter system, we demonstrate the potential to identify novel regulatory elements using this approach, including distal regulatory elements. This work provides a platform for the identification of novel developmental regulatory elements in Parhyale, and offers a framework for performing such experiments in other emerging research organisms.Primary Findings-Omni-ATAC-Seq identifies cis-regulatory elements genome-wide during crustacean embryogenesis-Combined short- and long-read RNA-Seq improves the Parhyale genome annotation-ImpulseDE2 analysis identifies dynamically regulated candidate regulatory elements-NucleoATAC and HINT-ATAC enable inference of nucleosome occupancy and transcription factor binding-Fuzzy clustering reveals peaks with distinct accessibility and chromatin dynamics-Integration of accessibility and gene expression reveals possible enhancers and repressors-Omni-ATAC can identify known and novel regulatory elements


2019 ◽  
Vol 97 (Supplement_3) ◽  
pp. 135-135
Author(s):  
Shengfa F Liao ◽  
Shamimul Hasan ◽  
Jean M Feugang

Abstract Animal life essentially is a set of gene expression processes. Thorough understanding of these processes driven by dietary nutrients and other environmental factors can be regarded as a bottom line of modern advanced animal nutrition research for improving animal growth, development, health, production, and reproduction performance. Nutrigenomics, a genome-wide approach using the knowledge and techniques obtained from the disciplines of genomics (including transcriptomics) and molecular biology, is to study the effects of dietary nutrients on cellular gene expression, cellular metabolic responses and, ultimately, the phenotypic changes of a living organism. Transcriptomics can be applied to investigate animal tissue transcriptome at a defined physiological or nutritional state, which provides a holistic view of the intracellular expression of RNA, especially mRNA. As a novel, promising transcriptomics approach, RNA sequencing (RNA-Seq) technology can monitor all-gene expressions simultaneously in response to dietary intervention. The principle and history of RNA-Seq technology will be briefly reviewed, and the three principal steps of this methodology, including the laboratory analysis of tissue samples, the bioinformatics analysis of the generated sequence data, and the subsequent biological interpretation of the data, will be described. The application of RNA-Seq technology in different areas of animal nutrition research, which include maternal nutrition, feeding strategy and gut microbiota, will be summarized. Lastly, the application of RNA-Seq technology in swine science and nutrition research will also be discussed. In short, to further improve animal feeding or production efficiency, RNA-Seq technology holds a great potential to be employed to explore the new insights into better understanding of nutrient-gene interactions in agricultural animals, and it is expected that the application of this cutting-edge technology in animal nutrition research will continue to grow in the foreseeable future. This research was supported in part by a USDA-NIFA Multistate Project (No. 1007691).


Sign in / Sign up

Export Citation Format

Share Document