scholarly journals Strategies for detecting and identifying biological signals amidst the variation commonly found in RNA sequencing data

BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
William W. Wilfinger ◽  
Robert Miller ◽  
Hamid R. Eghbalnia ◽  
Karol Mackey ◽  
Piotr Chomczynski

Abstract Background RNA sequencing analysis focus on the detection of differential gene expression changes that meet a two-fold minimum change between groups. The variability present in RNA sequencing data may obscure the detection of valuable information when specific genes within certain samples display large expression variability. This paper develops methods that apply variance and dispersion estimates to intra-group data to identify genes with expression values that diverge from the group envelope. STRING database analysis of the identified genes characterize gene affiliations involved in physiological regulatory networks that contribute to biological variability. Individuals with divergent gene groupings within network pathways can thereby be identified and judiciously evaluated prior to standard differential analysis. Results A three-step process is presented for evaluating biological variability within a group in RNA sequencing data in which gene counts were: (1) scaled to minimize heteroscedasticity; (2) rank-ordered to detect potentially divergent “trendlines” for every gene in the data set; and (3) tested with the STRING database to identify statistically significant pathway associations among the genes displaying marked trendline variability and dispersion. This approach was used to identify the “trendline” profile of every gene in three test data sets. Control data from an in-house data set and two archived samples revealed that 65–70% of the sequenced genes displayed trendlines with minimal variation and dispersion across the sample group after rank-ordering the samples; this is referred to as a linear trendline. Smaller subsets of genes within the three data sets displayed markedly skewed trendlines, wide dispersion and variability. STRING database analysis of these genes identified interferon-mediated response networks in 11–20% of the individuals sampled at the time of blood collection. For example, in the three control data sets, 14 to 26 genes in the defense response to virus pathway were identified in 7 individuals at false discovery rates ≤1.92 E-15. Conclusions This analysis provides a rationale for identifying and characterizing notable gene expression variability within a study group. The identification of highly variable genes and their network associations within specific individuals empowers more judicious inspection of the sample group prior to differential gene expression analysis.

2020 ◽  
Vol 9 (1) ◽  
Author(s):  
Greg Tram ◽  
William P. Klare ◽  
Joel A. Cain ◽  
Basem Mourad ◽  
Stuart J. Cordwell ◽  
...  

Campylobacter jejuni is a foodborne pathogen and an important contributor to gastroenteritis in humans. C. jejuni readily forms biofilms which may play a role in the transmission of the pathogen from animals to humans. Herein, we present RNA sequencing data investigating differential gene expression in biofilm and planktonic C. jejuni. These data provide insight into pathways which may be important to biofilm formation in this organism.


Circulation ◽  
2020 ◽  
Vol 142 (14) ◽  
pp. 1374-1388
Author(s):  
Yanming Li ◽  
Pingping Ren ◽  
Ashley Dawson ◽  
Hernan G. Vasquez ◽  
Waleed Ageedi ◽  
...  

Background: Ascending thoracic aortic aneurysm (ATAA) is caused by the progressive weakening and dilatation of the aortic wall and can lead to aortic dissection, rupture, and other life-threatening complications. To improve our understanding of ATAA pathogenesis, we aimed to comprehensively characterize the cellular composition of the ascending aortic wall and to identify molecular alterations in each cell population of human ATAA tissues. Methods: We performed single-cell RNA sequencing analysis of ascending aortic tissues from 11 study participants, including 8 patients with ATAA (4 women and 4 men) and 3 control subjects (2 women and 1 man). Cells extracted from aortic tissue were analyzed and categorized with single-cell RNA sequencing data to perform cluster identification. ATAA-related changes were then examined by comparing the proportions of each cell type and the gene expression profiles between ATAA and control tissues. We also examined which genes may be critical for ATAA by performing the integrative analysis of our single-cell RNA sequencing data with publicly available data from genome-wide association studies. Results: We identified 11 major cell types in human ascending aortic tissue; the high-resolution reclustering of these cells further divided them into 40 subtypes. Multiple subtypes were observed for smooth muscle cells, macrophages, and T lymphocytes, suggesting that these cells have multiple functional populations in the aortic wall. In general, ATAA tissues had fewer nonimmune cells and more immune cells, especially T lymphocytes, than control tissues did. Differential gene expression data suggested the presence of extensive mitochondrial dysfunction in ATAA tissues. In addition, integrative analysis of our single-cell RNA sequencing data with public genome-wide association study data and promoter capture Hi-C data suggested that the erythroblast transformation-specific related gene( ERG ) exerts an important role in maintaining normal aortic wall function. Conclusions: Our study provides a comprehensive evaluation of the cellular composition of the ascending aortic wall and reveals how the gene expression landscape is altered in human ATAA tissue. The information from this study makes important contributions to our understanding of ATAA formation and progression.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Kyu-Sang Lim ◽  
Qian Dong ◽  
Pamela Moll ◽  
Jana Vitkovska ◽  
Gregor Wiktorin ◽  
...  

Abstract Background Gene expression profiling in blood is a potential source of biomarkers to evaluate or predict phenotypic differences between pigs but is expensive and inefficient because of the high abundance of globin mRNA in porcine blood. These limitations can be overcome by the use of QuantSeq 3’mRNA sequencing (QuantSeq) combined with a method to deplete or block the processing of globin mRNA prior to or during library construction. Here, we validated the effectiveness of QuantSeq using a novel specific globin blocker (GB) that is included in the library preparation step of QuantSeq. Results In data set 1, four concentrations of the GB were applied to RNA samples from two pigs. The GB significantly reduced the proportion of globin reads compared to non-GB (NGB) samples (P = 0.005) and increased the number of detectable non-globin genes. The highest evaluated concentration (C1) of the GB resulted in the largest reduction of globin reads compared to the NGB (from 56.4 to 10.1%). The second highest concentration C2, which showed very similar globin depletion rates (12%) as C1 but a better correlation of the expression of non-globin genes between NGB and GB (r = 0.98), allowed the expression of an additional 1295 non-globin genes to be detected, although 40 genes that were detected in the NGB sample (at a low level) were not present in the GB library. Concentration C2 was applied in the rest of the study. In data set 2, the distribution of the percentage of globin reads for NGB (n = 184) and GB (n = 189) samples clearly showed the effects of the GB on reducing globin reads, in particular for HBB, similar to results from data set 1. Data set 3 (n = 84) revealed that the proportion of globin reads that remained in GB samples was significantly and positively correlated with the reticulocyte count in the original blood sample (P < 0.001). Conclusions The effect of the GB on reducing the proportion of globin reads in porcine blood QuantSeq was demonstrated in three data sets. In addition to increasing the efficiency of sequencing non-globin mRNA, the GB for QuantSeq has an advantage that it does not require an additional step prior to or during library creation. Therefore, the GB is a useful tool in the quantification of whole gene expression profiles in porcine blood.


2015 ◽  
Author(s):  
Benjamin K Johnson ◽  
Matthew B Scholz ◽  
Tracy K Teal ◽  
Robert B Abramovitch

Summary: SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. The workflow is implemented in Python for file management and sequential execution of each analysis step and is available for Mac OS X, Microsoft Windows, and Linux. To promote the use of SPARTA as a teaching platform, a web-based tutorial is available explaining how RNA-seq data are processed and analyzed by the software. Availability and Implementation: Tutorial and workflow can be found at sparta.readthedocs.org. Teaching materials are located at sparta-teaching.readthedocs.org. Source code can be downloaded at www.github.com/abramovitchMSU/, implemented in Python and supported on Mac OS X, Linux, and MS Windows. Contact: Robert B. Abramovitch ([email protected]) Supplemental Information: Supplementary data are available online


2020 ◽  
Author(s):  
Rian Pratama ◽  
Jae Joon Hwang ◽  
Ji Hye Lee ◽  
Giltae Song ◽  
Hae Ryoun Park

Abstract Background: Recently, the possibility of tumour classification based on genetic data has been investigated. However, genetic datasets are difficult to handle because of their massive size and complexity of manipulation. In the present study, we examined the diagnostic performance of machine learning applications using imaging-based classifications of oral squamous cell carcinoma (OSCC) gene sets.Methods: RNA sequencing data from SCC tissues from various sites, including oral, non-oral head and neck, oesophageal, and cervical regions, were downloaded from The Cancer Genome Atlas (TCGA). The feature genes were extracted through a convolutional neural network (CNN) and machine learning, and the performance of each analysis was compared.Results: The ability of the machine learning analysis to classify OSCC tumours was excellent. However, the tool exhibited poorer performance in discriminating histopathologically dissimilar cancers derived from the same type of tissue than in differentiating cancers of the same histopathologic type with different tissue origins, revealing that the differential gene expression pattern is a more important factor than the histopathologic features for differentiating cancer types.Conclusion: The CNN-based diagnostic model and the visualisation methods using RNA sequencing data were useful for correctly categorising OSCC. The analysis showed differentially expressed genes in multiwise comparisons of various types of SCCs, such as KCNA10, FOSL2, and PRDM16, and extracted leader genes from pairwise comparisons were FGF20, DLC1, and ZNF705D.


2022 ◽  
Author(s):  
Sofya Lipnitskaya ◽  
Yang Shen ◽  
Stefan Legewie ◽  
Holger Klein ◽  
Kolja Becker

Abstract Background: Recent studies in the area of transcriptomics performed on single-cell and population levels reveal noticeable variability in gene expression measurements provided by different RNA sequencing technologies. Due to increased noise and complexity of single-cell RNA-Seq (scRNA-Seq) data over the bulk experiment, there is a substantial number of variably-expressed genes and so-called dropouts, challenging the subsequent computational analysis and potentially leading to false positive discoveries. In order to investigate factors affecting technical variability between RNA sequencing experiments of different technologies, we performed a systematic assessment of single-cell and bulk RNA-Seq data, which have undergone the same pre-processing and sample preparation procedures. Results: Our analysis indicates that variability between gene expression measurements as well as dropout events are not exclusively caused by biological variability, low expression levels, or random variation. Furthermore, we propose FAVSeq, a machine learning-assisted pipeline for detection of factors contributing to gene expression variability in matched RNA-Seq data provided by two technologies. Based on the analysis of the matched bulk and single-cell dataset, we found the 3'-UTR and transcript lengths as the most relevant effectors of the observed variation between RNA-Seq experiments, while the same factors together with cellular compartments were shown to be associated with dropouts. Conclusions: Here, we investigated the sources of variation in RNA-Seq profiles of matched single-cell and bulk experiments. In addition, we proposed the FAVSeq pipeline for analyzing multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts. Hereby, the derived knowledge can be employed further in order to improve the interpretation of RNA-Seq data and identify genes that can be affected by assay-based deviations. Source code is available under the MIT license at https://github.com/slipnitskaya/FAVSeq.


2018 ◽  
Author(s):  
Felix Brechtmann ◽  
Agnė Matusevičiūtė ◽  
Christian Mertes ◽  
Vicente A Yépez ◽  
Žiga Avsec ◽  
...  

AbstractRNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (OUTlier in RNA-seq fInDER), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read count expectations according to the co-variation among genes resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best correction of artificially corrupted data. Precision–recall analyses using simulated outlier read counts demonstrated the importance of combining correction for co-variation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a data set, for identifying outlier samples with too many aberrantly expressed genes, and for the P-value-based detection of aberrant gene expression, with false discovery rate adjustment. Overall, OUTRIDER provides a computationally fast and scalable end-to-end solution for identifying aberrantly expressed genes, suitable for use by rare disease diagnostic platforms.


Sign in / Sign up

Export Citation Format

Share Document