Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data

Abstract RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount.

Download Full-text

intePareto: an R package for integrative analyses of RNA-Seq and ChIP-Seq data

BMC Genomics ◽

10.1186/s12864-020-07205-6 ◽

2020 ◽

Vol 21 (S11) ◽

Author(s):

Yingying Cao ◽

Simo Kitanovski ◽

Daniel Hoffmann

Keyword(s):

Gene Expression ◽

Expression Analysis ◽

High Throughput Sequencing ◽

Differential Expression Analysis ◽

Cell Types ◽

R Package ◽

Integrative Approach ◽

Integrated Analysis ◽

Data Sets ◽

Rna Seq

Abstract Background RNA-Seq, the high-throughput sequencing (HT-Seq) of mRNAs, has become an essential tool for characterizing gene expression differences between different cell types and conditions. Gene expression is regulated by several mechanisms, including epigenetically by post-translational histone modifications which can be assessed by ChIP-Seq (Chromatin Immuno-Precipitation Sequencing). As more and more biological samples are analyzed by the combination of ChIP-Seq and RNA-Seq, the integrated analysis of the corresponding data sets becomes, theoretically, a unique option to study gene regulation. However, technically such analyses are still in their infancy. Results Here we introduce intePareto, a computational tool for the integrative analysis of RNA-Seq and ChIP-Seq data. With intePareto we match RNA-Seq and ChIP-Seq data at the level of genes, perform differential expression analysis between biological conditions, and prioritize genes with consistent changes in RNA-Seq and ChIP-Seq data using Pareto optimization. Conclusion intePareto facilitates comprehensive understanding of high dimensional transcriptomic and epigenomic data. Its superiority to a naive differential gene expression analysis with RNA-Seq and available integrative approach is demonstrated by analyzing a public dataset.

Download Full-text

SDImpute: A statistical block imputation method based on cell-level and gene-level information for dropouts in single-cell RNA-seq data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009118 ◽

2021 ◽

Vol 17 (6) ◽

pp. e1009118

Author(s):

Jing Qi ◽

Yang Zhou ◽

Zicen Zhao ◽

Shuilin Jin

Keyword(s):

Gene Expression ◽

Single Cell ◽

Differential Expression Analysis ◽

Cell Types ◽

Rna Seq ◽

Cell Level ◽

Gene Level ◽

Level Information ◽

Downstream Analysis ◽

Gene Expression Levels

The single-cell RNA sequencing (scRNA-seq) technologies obtain gene expression at single-cell resolution and provide a tool for exploring cell heterogeneity and cell types. As the low amount of extracted mRNA copies per cell, scRNA-seq data exhibit a large number of dropouts, which hinders the downstream analysis of the scRNA-seq data. We propose a statistical method, SDImpute (Single-cell RNA-seq Dropout Imputation), to implement block imputation for dropout events in scRNA-seq data. SDImpute automatically identifies the dropout events based on the gene expression levels and the variations of gene expression across similar cells and similar genes, and it implements block imputation for dropouts by utilizing gene expression unaffected by dropouts from similar cells. In the experiments, the results of the simulated datasets and real datasets suggest that SDImpute is an effective tool to recover the data and preserve the heterogeneity of gene expression across cells. Compared with the state-of-the-art imputation methods, SDImpute improves the accuracy of the downstream analysis including clustering, visualization, and differential expression analysis.

Download Full-text

Identification of sheep lncRNAs related to the immune response to vaccines and aluminium adjuvants

BMC Genomics ◽

10.1186/s12864-021-08086-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Martin Bilbao-Arribas ◽

Endika Varela-Martínez ◽

Naiara Abendaño ◽

Damián de Andrés ◽

Lluís Luján ◽

...

Keyword(s):

Immune Response ◽

Expression Analysis ◽

Expression Patterns ◽

Differential Expression Analysis ◽

Rna Seq ◽

Protein Coding ◽

Protein Coding Genes ◽

Livestock Species ◽

Non Coding Rnas ◽

Aluminium Adjuvant

Abstract Background Long non-coding RNAs (lncRNAs) are involved in several immune processes, including the immune response to vaccination, but most of them remain uncharacterised in livestock species. The mechanism of action of aluminium adjuvants as vaccine components is neither not fully understood. Results We built a transcriptome from sheep PBMCs RNA-seq data in order to identify unannotated lncRNAs and analysed their expression patterns along protein coding genes. We found 2284 novel lncRNAs and assessed their conservation in terms of sequence and synteny. Differential expression analysis performed between animals inoculated with commercial vaccines or aluminium adjuvant alone and the co-expression analysis revealed lncRNAs related to the immune response to vaccines and adjuvants. A group of co-expressed genes enriched in cytokine signalling and production highlighted the differences between different treatments. A number of differentially expressed lncRNAs were correlated with a divergently located protein-coding gene, such as the OSM cytokine. Other lncRNAs were predicted to act as sponges of miRNAs involved in immune response regulation. Conclusions This work enlarges the lncRNA catalogue in sheep and puts an accent on their involvement in the immune response to repetitive vaccination, providing a basis for further characterisation of the non-coding sheep transcriptome within different immune cells.

Download Full-text

High temporal resolution RNA-seq time course data reveals mammalian lncRNA activation mirrors neighbouring protein-coding genes

10.1101/2021.08.25.457323 ◽

2021 ◽

Author(s):

Walter Muskovic ◽

Eva Slavich ◽

Ben Maslen ◽

Dominik Kaczorowski ◽

Joseph Cursons ◽

...

Keyword(s):

Gene Expression ◽

Temporal Resolution ◽

Time Course ◽

Human Cells ◽

Careful Examination ◽

High Temporal Resolution ◽

Rna Seq ◽

Protein Coding ◽

Lncrna Expression ◽

Protein Coding Genes

Background: The advent of next-generation sequencing revealed extensive transcription beyond protein-coding genes, identifying tens of thousands of long non-coding RNAs (lncRNAs). Selected functional examples raised the possibility that lncRNAs, as a class, may maintain broad regulatory roles. Compellingly, lncRNA expression is strongly linked with adjacent protein-coding gene expression, suggesting a potential cis-regulatory function. Evidence for these regulatory roles may be obtained through careful examination of the precise timing of lncRNA expression relative to adjacent protein-coding genes. Results: Where causal cis-regulatory relationships exist, lncRNA activation is expected to precede changes in adjacent target gene expression. Using an RNA-seq time course of uniquely high temporal resolution, we profiled the expression dynamics of several thousand lncRNAs and protein-coding genes in synchronized, transitioning human cells. Our findings reveal lncRNAs are expressed synchronously with adjacent protein-coding genes. Analysis of lipopolysaccharide-activated mouse dendritic cells revealed the same temporal relationship observed in transitioning human cells. Conclusion: Our findings suggest broad-scale cis-regulatory roles for lncRNAs are not common. The strong association between lncRNAs and adjacent genes may instead indicate an origin as transcriptional by-products from active protein-coding gene promoters and enhancers.

Download Full-text

Genetic Dissection of Hypertrophic Cardiomyopathy with Myocardial RNA-Seq

International Journal of Molecular Sciences ◽

10.3390/ijms21093040 ◽

2020 ◽

Vol 21 (9) ◽

pp. 3040 ◽

Cited By ~ 2

Author(s):

Jun Gao ◽

John Collyer ◽

Maochun Wang ◽

Fengping Sun ◽

Fuyi Xu

Keyword(s):

Hypertrophic Cardiomyopathy ◽

Heart Development ◽

Gene Networks ◽

Complex Disease ◽

Differential Expression Analysis ◽

Genetic Modifiers ◽

Rna Seq ◽

Protein Coding ◽

Network Analyses ◽

Protein Coding Genes

Hypertrophic cardiomyopathy (HCM) is an inherited disorder of the myocardium, and pathogenic mutations in the sarcomere genes myosin heavy chain 7 (MYH7) and myosin-binding protein C (MYBPC3) explain 60%–70% of observed clinical cases. The heterogeneity of phenotypes observed in HCM patients, however, suggests that novel causative genes or genetic modifiers likely exist. Here, we systemically evaluated RNA-seq data from 28 HCM patients and 9 healthy controls with pathogenic variant identification, differential expression analysis, and gene co-expression and protein–protein interaction network analyses. We identified 43 potential pathogenic variants in 19 genes in 24 HCM patients. Genes with more than one variant included the following: MYBPC3, TTN, MYH7, PSEN2, and LDB3. A total of 2538 protein-coding genes, six microRNAs (miRNAs), and 1617 long noncoding RNAs (lncRNAs) were identified differentially expressed between the groups, including several well-characterized cardiomyopathy-related genes (ANKRD1, FHL2, TGFB3, miR-30d, and miR-154). Gene enrichment analysis revealed that those genes are significantly involved in heart development and physiology. Furthermore, we highlighted four subnetworks: mtDNA-subnetwork, DSP-subnetwork, MYH7-subnetwork, and MYBPC3-subnetwork, which could play significant roles in the progression of HCM. Our findings further illustrate that HCM is a complex disease, which results from mutations in multiple protein-coding genes, modulation by non-coding RNAs and perturbations in gene networks.

Download Full-text

AtRTD2: A Reference Transcript Dataset for accurate quantification of alternative splicing and expression changes in Arabidopsis thaliana RNA-seq data

10.1101/051938 ◽

2016 ◽

Cited By ~ 4

Author(s):

Runxuan Zhang ◽

Cristiane P. G. Calixto ◽

Yamile Marquez ◽

Peter Venhuizen ◽

Nikoleta A. Tzioutziou ◽

...

Keyword(s):

Gene Expression ◽

Experimental Data ◽

Alternative Splicing ◽

Rna Seq ◽

Protein Coding ◽

Transcript Isoforms ◽

Transcript Quantification ◽

Protein Coding Genes ◽

Genome Wide ◽

Reference Transcript

AbstractBackgroundAlternative splicing is the major post-transcriptional mechanism by which gene expression is regulated and affects a wide range of processes and responses in most eukaryotic organisms. RNA-sequencing (RNA-seq) can generate genome-wide quantification of individual transcript isoforms to identify changes in expression and alternative splicing. RNA-seq is an essential modern tool but its ability to accurately quantify transcript isoforms depends on the diversity, completeness and quality of the transcript information.ResultsWe have developed a new Reference Transcript Dataset for Arabidopsis (AtRTD2) for RNA-seq analysis containing over 82k non-redundant transcripts, whereby 74,194 transcripts originate from 27,667 protein-coding genes. A total of 13,524 protein-coding genes have at least one alternatively spliced transcript in AtRTD2 such that about 60% of the 22,453 protein-coding, intron-containing genes in Arabidopsis undergo alternative splicing. More than 600 putative U12 introns were identified in more than 2,000 transcripts. AtRTD2 was generated from transcript assemblies of ca. 8.5 billion pairs of reads from 285 RNA-seq data sets obtained from 129 RNA-seq libraries and merged along with the previous version, AtRTD, and Araport11 transcript assemblies. AtRTD2 increases the diversity of transcripts and through application of stringent filters represents the most extensive and accurate transcript collection for Arabidopsis to date. We have demonstrated a generally good correlation of alternative splicing ratios from RNA-seq data analysed by Salmon and experimental data from high resolution RT-PCR. However, we have observed inaccurate quantification of transcript isoforms for genes with multiple transcripts which have variation in the lengths of their UTRs. This variation is not effectively corrected in RNA-seq analysis programmes and will therefore impact RNA-seq analyses generally. To address this, we have tested different genome-wide modifications of AtRTD2 to improve transcript quantification and alternative splicing analysis. As a result, we release AtRTD2-QUASI specifically for use in Quantification of Alternatively Spliced Isoforms and demonstrate that it out-performs other available transcriptomes for RNA-seq analysis.ConclusionsWe have generated a new transcriptome resource for RNA-seq analyses in Arabidopsis (AtRTD2) designed to address quantification of different isoforms and alternative splicing in gene expression studies. Experimental validation of alternative splicing changes identified inaccuracies in transcript quantification due to UTR length variation. To solve this problem, we also release a modified reference transcriptome, AtRTD2-QUASI for quantification of transcript isoforms, which shows high correlation with experimental data.

Download Full-text

Integrated modeling of protein-coding genes in theManduca sextagenome using RNA-seq data from the biochemical model insect

10.1603/ice.2016.110841 ◽

2016 ◽

Cited By ~ 1

Author(s):

Xiaolong Cao

Keyword(s):

Integrated Modeling ◽

Rna Seq ◽

Protein Coding ◽

Protein Coding Genes ◽

Biochemical Model

Download Full-text

GC-AG introns features in long non-coding and protein-coding genes suggest their role in gene expression regulation

10.26226/morressier.5ebd45acffea6f735881b039 ◽

2020 ◽

Author(s):

Monah Abou Alezz

Keyword(s):

Gene Expression ◽

Gene Expression Regulation ◽

Expression Regulation ◽

Protein Coding ◽

Protein Coding Genes

Download Full-text

Gene Expression Profile and Co-Expression Network of Pearl Gentian Grouper under Cold Stress by Integrating Illumina and PacBio Sequences

Animals ◽

10.3390/ani11061745 ◽

2021 ◽

Vol 11 (6) ◽

pp. 1745

Author(s):

Ben-Ben Miao ◽

Su-Fang Niu ◽

Ren-Xie Wu ◽

Zhen-Bang Liang ◽

Bao-Gui Tang ◽

...

Keyword(s):

Gene Expression ◽

Cold Stress ◽

Differential Expression Analysis ◽

Endocrine System ◽

Control Group ◽

Regulation Mechanism ◽

Rna Seq ◽

Cold Tolerant ◽

Stress Signals ◽

Membrane Changes

Pearl gentian grouper (Epinephelus fuscoguttatus ♀ × Epinephelus lanceolatus ♂) is a fish of high commercial value in the aquaculture industry in Asia. However, this hybrid fish is not cold-tolerant, and its molecular regulation mechanism underlying cold stress remains largely elusive. This study thus investigated the liver transcriptomic responses of pearl gentian grouper by comparing the gene expression of cold stress groups (20, 15, 12, and 12 °C for 6 h) with that of control group (25 °C) using PacBio SMRT-Seq and Illumina RNA-Seq technologies. In SMRT-Seq analysis, a total of 11,033 full-length transcripts were generated and used as reference sequences for further RNA-Seq analysis. In RNA-Seq analysis, 3271 differentially expressed genes (DEGs), two low-temperature specific modules (tan and blue modules), and two significantly expressed gene sets (profiles 0 and 19) were screened by differential expression analysis, weighted gene co-expression networks analysis (WGCNA), and short time-series expression miner (STEM), respectively. The intersection of the above analyses further revealed some key genes, such as PCK, ALDOB, FBP, G6pC, CPT1A, PPARα, SOCS3, PPP1CC, CYP2J, HMGCR, CDKN1B, and GADD45Bc. These genes were significantly enriched in carbohydrate metabolism, lipid metabolism, signal transduction, and endocrine system pathways. All these pathways were linked to biological functions relevant to cold adaptation, such as energy metabolism, stress-induced cell membrane changes, and transduction of stress signals. Taken together, our study explores an overall and complex regulation network of the functional genes in the liver of pearl gentian grouper, which could benefit the species in preventing damage caused by cold stress.

Download Full-text

RNA-Seq Data-Mining Allows the Discovery of Two Long Non-Coding RNA Biomarkers of Viral Infection in Humans

International Journal of Molecular Sciences ◽

10.3390/ijms21082748 ◽

2020 ◽

Vol 21 (8) ◽

pp. 2748 ◽

Cited By ~ 1

Author(s):

Ruth Barral-Arca ◽

Alberto Gómez-Carballa ◽

Miriam Cebey-López ◽

María José Currás-Tuala ◽

Sara Pischedda ◽

...

Keyword(s):

Gene Expression ◽

Viral Infections ◽

Umbilical Vein ◽

Cell Types ◽

Dermal Fibroblasts ◽

Learning Approaches ◽

Rna Seq ◽

Wide Range ◽

Healthy Control ◽

Umbilical Vein Endothelial Cells

There is a growing interest in unraveling gene expression mechanisms leading to viral host invasion and infection progression. Current findings reveal that long non-coding RNAs (lncRNAs) are implicated in the regulation of the immune system by influencing gene expression through a wide range of mechanisms. By mining whole-transcriptome shotgun sequencing (RNA-seq) data using machine learning approaches, we detected two lncRNAs (ENSG00000254680 and ENSG00000273149) that are downregulated in a wide range of viral infections and different cell types, including blood monocluclear cells, umbilical vein endothelial cells, and dermal fibroblasts. The efficiency of these two lncRNAs was positively validated in different viral phenotypic scenarios. These two lncRNAs showed a strong downregulation in virus-infected patients when compared to healthy control transcriptomes, indicating that these biomarkers are promising targets for infection diagnosis. To the best of our knowledge, this is the very first study using host lncRNAs biomarkers for the diagnosis of human viral infections.

Download Full-text