Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias

AbstractSingle cell RNA sequencing (scRNA-seq) has rapidly gained popularity for profiling transcriptomes of hundreds to thousands of single cells. This technology has led to the discovery of novel cell types and revealed insights into the development of complex tissues. However, many technical challenges need to be overcome during data generation. Due to minute amounts of starting material, samples undergo extensive amplification, increasing technical variability. A solution for mitigating amplification biases is to include Unique Molecular Identifiers (UMIs), which tag individual molecules. Transcript abundances are then estimated from the number of unique UMIs aligning to a specific gene and PCR duplicates resulting in copies of the UMI are not included in expression estimates. Here we investigate the effect of gene length bias in scRNA-Seq across a variety of datasets differing in terms of capture technology, library preparation, cell types and species. We find that scRNA-seq datasets that have been sequenced using a full-length transcript protocol exhibit gene length bias akin to bulk RNA-seq data. Specifically, shorter genes tend to have lower counts and a higher rate of dropout. In contrast, protocols that include UMIs do not exhibit gene length bias, and have a mostly uniform rate of dropout across genes of varying length. Across four different scRNA-Seq datasets profiling mouse embryonic stem cells (mESCs), we found the subset of genes that are only detected in the UMI datasets tended to be shorter, while the subset of genes detected only in the full-length datasets tended to be longer. We briefly discuss the role of these genes in the context of differential expression testing and GO analysis. In addition, despite clear differences between UMI and full-length transcript data, we illustrate that full-length and UMI data can be combined to reveal underlying biology influencing expression of mESCs.

Download Full-text

Gene length and detection bias in single cell RNA sequencing protocols

F1000Research ◽

10.12688/f1000research.11290.1 ◽

2017 ◽

Vol 6 ◽

pp. 595 ◽

Cited By ~ 33

Author(s):

Belinda Phipson ◽

Luke Zappia ◽

Alicia Oshlack

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Full Length ◽

Specific Gene ◽

Gene Length ◽

Length Bias ◽

Single Cell Rna Sequencing ◽

Full Length Transcript ◽

Gene Length Bias

Background: Single cell RNA sequencing (scRNA-seq) has rapidly gained popularity for profiling transcriptomes of hundreds to thousands of single cells. This technology has led to the discovery of novel cell types and revealed insights into the development of complex tissues. However, many technical challenges need to be overcome during data generation. Due to minute amounts of starting material, samples undergo extensive amplification, increasing technical variability. A solution for mitigating amplification biases is to include unique molecular identifiers (UMIs), which tag individual molecules. Transcript abundances are then estimated from the number of unique UMIs aligning to a specific gene, with PCR duplicates resulting in copies of the UMI not included in expression estimates. Methods: Here we investigate the effect of gene length bias in scRNA-Seq across a variety of datasets that differ in terms of capture technology, library preparation, cell types and species. Results: We find that scRNA-seq datasets that have been sequenced using a full-length transcript protocol exhibit gene length bias akin to bulk RNA-seq data. Specifically, shorter genes tend to have lower counts and a higher rate of dropout. In contrast, protocols that include UMIs do not exhibit gene length bias, with a mostly uniform rate of dropout across genes of varying length. Across four different scRNA-Seq datasets profiling mouse embryonic stem cells (mESCs), we found the subset of genes that are only detected in the UMI datasets tended to be shorter, while the subset of genes detected only in the full-length datasets tended to be longer. Conclusions: We find that the choice of scRNA-seq protocol influences the detection rate of genes, and that full-length datasets exhibit gene-length bias. In addition, despite clear differences between UMI and full-length transcript data, we illustrate that full-length and UMI data can be combined to reveal the underlying biology influencing expression of mESCs.

Download Full-text

NKL Homeobox Gene VENTX Is Part of a Regulatory Network in Human Conventional Dendritic Cells

International Journal of Molecular Sciences ◽

10.3390/ijms22115902 ◽

2021 ◽

Vol 22 (11) ◽

pp. 5902

Author(s):

Stefan Nagel ◽

Claudia Pommerenke ◽

Corinna Meyer ◽

Hans G. Drexler

Keyword(s):

Dendritic Cells ◽

Transcription Factors ◽

Cell Lines ◽

Expression Profiling ◽

Regulatory Network ◽

Homeobox Gene ◽

Leukemia Cell Line ◽

Specific Gene ◽

Rna Seq ◽

And Function

Recently, we documented a hematopoietic NKL-code mapping physiological expression patterns of NKL homeobox genes in human myelopoiesis including monocytes and their derived dendritic cells (DCs). Here, we enlarge this map to include normal NKL homeobox gene expressions in progenitor-derived DCs. Analysis of public gene expression profiling and RNA-seq datasets containing plasmacytoid and conventional dendritic cells (pDC and cDC) demonstrated HHEX activity in both entities while cDCs additionally expressed VENTX. The consequent aim of our study was to examine regulation and function of VENTX in DCs. We compared profiling data of VENTX-positive cDC and monocytes with VENTX-negative pDC and common myeloid progenitor entities and revealed several differentially expressed genes encoding transcription factors and pathway components, representing potential VENTX regulators. Screening of RNA-seq data for 100 leukemia/lymphoma cell lines identified prominent VENTX expression in an acute myelomonocytic leukemia cell line, MUTZ-3 containing inv(3)(q21q26) and t(12;22)(p13;q11) and representing a model for DC differentiation studies. Furthermore, extended gene analyses indicated that MUTZ-3 is associated with the subtype cDC2. In addition to analysis of public chromatin immune-precipitation data, subsequent knockdown experiments and modulations of signaling pathways in MUTZ-3 and control cell lines confirmed identified candidate transcription factors CEBPB, ETV6, EVI1, GATA2, IRF2, MN1, SPIB, and SPI1 and the CSF-, NOTCH-, and TNFa-pathways as VENTX regulators. Live-cell imaging analyses of MUTZ-3 cells treated for VENTX knockdown excluded impacts on apoptosis or induced alteration of differentiation-associated cell morphology. In contrast, target gene analysis performed by expression profiling of knockdown-treated MUTZ-3 cells revealed VENTX-mediated activation of several cDC-specific genes including CSFR1, EGR2, and MIR10A and inhibition of pDC-specific genes like RUNX2. Taken together, we added NKL homeobox gene activities for progenitor-derived DCs to the NKL-code, showing that VENTX is expressed in cDCs but not in pDCs and forms part of a cDC-specific gene regulatory network operating in DC differentiation and function.

Download Full-text

A tumor microenvironment-specific gene expression signature predicts chemotherapy resistance in colorectal cancer patients

npj Precision Oncology ◽

10.1038/s41698-021-00142-x ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Xiaoqiang Zhu ◽

Xianglong Tian ◽

Linhua Ji ◽

Xinyu Zhang ◽

Yingying Cao ◽

...

Keyword(s):

Colorectal Cancer ◽

Tumor Microenvironment ◽

Drug Sensitivity ◽

Chemotherapy Resistance ◽

Gene Expression Signature ◽

Prediction Algorithm ◽

Specific Gene ◽

Chemotherapy Response ◽

Rna Seq ◽

Mesenchymal Transition

AbstractStudies have shown that tumor microenvironment (TME) might affect drug sensitivity and the classification of colorectal cancer (CRC). Using TME-specific gene signature to identify CRC subtypes with distinctive clinical relevance has not yet been tested. A total of 18 “bulk” RNA-seq datasets (total n = 2269) and four single-cell RNA-seq datasets were included in this study. We constructed a “Signature associated with FOLFIRI resistant and Microenvironment” (SFM) that could discriminate both TME and drug sensitivity. Further, SFM subtypes were identified using K-means clustering and verified in three independent cohorts. Nearest template prediction algorithm was used to predict drug response. TME estimation was performed by CIBERSORT and microenvironment cell populations-counter (MCP-counter) methods. We identified six SFM subtypes based on SFM signature that discriminated both TME and drug sensitivity. The SFM subtypes were associated with distinct clinicopathological, molecular and phenotypic characteristics, specific enrichments of gene signatures, signaling pathways, prognosis, gut microbiome patterns, and tumor lymphocytes infiltration. Among them, SFM-C and -F were immune suppressive. SFM-F had higher stromal fraction with epithelial-to-mesenchymal transition phenotype, while SFM-C was characterized as microsatellite instability phenotype which was responsive to immunotherapy. SFM-D, -E, and -F were sensitive to FOLFIRI and FOLFOX, while SFM-A, -B, and -C were responsive to EGFR inhibitors. Finally, SFM subtypes had strong prognostic value in which SFM-E and -F had worse survival than other subtypes. SFM subtypes enable the stratification of CRC with potential chemotherapy response thereby providing more precise therapeutic options for these patients.

Download Full-text

Promoter G-quadruplexes and transcription factors cooperate to shape the cell type-specific transcriptome

Nature Communications ◽

10.1038/s41467-021-24198-2 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Sara Lago ◽

Matteo Nadai ◽

Filippo M. Cernilogar ◽

Maryam Kazerani ◽

Helena Domíniguez Moreno ◽

...

Keyword(s):

Transcription Factors ◽

Binding Sites ◽

Specific Gene ◽

Open Chromatin ◽

Rna Seq ◽

Transcript Levels ◽

Promoter Sequences ◽

G Quadruplex ◽

Cell Type Specific ◽

Folding State

AbstractCell identity is maintained by activation of cell-specific gene programs, regulated by epigenetic marks, transcription factors and chromatin organization. DNA G-quadruplex (G4)-folded regions in cells were reported to be associated with either increased or decreased transcriptional activity. By G4-ChIP-seq/RNA-seq analysis on liposarcoma cells we confirmed that G4s in promoters are invariably associated with high transcription levels in open chromatin. Comparing G4 presence, location and transcript levels in liposarcoma cells to available data on keratinocytes, we showed that the same promoter sequences of the same genes in the two cell lines had different G4-folding state: high transcript levels consistently associated with G4-folding. Transcription factors AP-1 and SP1, whose binding sites were the most significantly represented in G4-folded sequences, coimmunoprecipitated with their G4-folded promoters. Thus, G4s and their associated transcription factors cooperate to determine cell-specific transcriptional programs, making G4s to strongly emerge as new epigenetic regulators of the transcription machinery.

Download Full-text

Neuromuscular junction-specific genes screening by deep RNA-seq analysis

Cell & Bioscience ◽

10.1186/s13578-021-00590-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Tiankun Hui ◽

Hongyang Jing ◽

Xinsheng Lai

Keyword(s):

Ion Channel ◽

Neuromuscular Junctions ◽

Gene Set Enrichment Analysis ◽

Differentially Expressed ◽

Muscle Membrane ◽

Receptor Interaction ◽

Specific Gene ◽

Rna Seq ◽

Channel Activity ◽

Ppi Networks

Abstract Background Neuromuscular junctions (NMJs) are chemical synapses formed between motor neurons and skeletal muscle fibers and are essential for controlling muscle contraction. NMJ dysfunction causes motor disorders, muscle wasting, and even breathing difficulties. Increasing evidence suggests that many NMJ disorders are closely related to alterations in specific gene products that are highly concentrated in the synaptic region of the muscle. However, many of these proteins are still undiscovered. Thus, screening for NMJ-specific proteins is essential for studying NMJ and the pathogenesis of NMJ diseases. Results In this study, synaptic regions (SRs) and nonsynaptic regions (NSRs) of diaphragm samples from newborn (P0) and adult (3-month-old) mice were used for RNA-seq. A total of 92 and 182 genes were identified as differentially expressed between the SR and NSR in newborn and adult mice, respectively. Meanwhile, a total of 1563 genes were identified as differentially expressed between the newborn SR and adult SR. Gene Ontology (GO) enrichment analyses, Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis and gene set enrichment analysis (GSEA) of the DEGs were performed. Protein–protein interaction (PPI) networks were constructed using STRING and Cytoscape. Further analysis identified some novel proteins and pathways that may be important for NMJ development, maintenance and maturation. Specifically, Sv2b, Ptgir, Gabrb3, P2rx3, Dlgap1 and Rims1 may play roles in NMJ development. Hcn1 may localize to the muscle membrane to regulate NMJ maintenance. Trim63, Fbxo32 and several Asb family proteins may regulate muscle developmental-related processes. Conclusion Here, we present a complete dataset describing the spatiotemporal transcriptome changes in synaptic genes and important synaptic pathways. The neuronal projection-related pathway, ion channel activity and neuroactive ligand-receptor interaction pathway are important for NMJ development. The myelination and voltage-gated ion channel activity pathway may be important for NMJ maintenance. These data will facilitate the understanding of the molecular mechanisms underlying the development and maintenance of NMJ and the pathogenesis of NMJ disorders.

Download Full-text

Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network

Nature Communications ◽

10.1038/s41467-021-23143-7 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Mathys Grapotte ◽

Manu Saraswat ◽

Chloé Bessière ◽

Christophe Menichelli ◽

Jordan A. Ramilowski ◽

...

Keyword(s):

Transcription Initiation ◽

Tandem Repeats ◽

Specific Gene ◽

Rna Seq ◽

Transcription Start Sites ◽

Long Read ◽

Cap Analysis ◽

Dna Tandem Repeats ◽

Short Tandem ◽

Str Polymorphism

AbstractUsing the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.

Download Full-text

Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network

10.1101/2020.07.10.195636 ◽

2020 ◽

Author(s):

Mathys Grapotte ◽

Manu Saraswat ◽

Chloé Bessière ◽

Christophe Menichelli ◽

Jordan A. Ramilowski ◽

...

Keyword(s):

Transcription Initiation ◽

Tandem Repeats ◽

Specific Gene ◽

Rna Seq ◽

Transcription Start Sites ◽

Dna Motif ◽

Long Reads ◽

Cap Analysis ◽

Dna Tandem Repeats ◽

Short Tandem

Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of Transcription Start Sites (TSSs) in several species. Strikingly, ~ 72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probed these unassigned TSSs and showed that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we developed Cap Trap RNA-seq, a technology which combines cap trapping and long reads MinION sequencing. We trained sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveiled the importance of STR surrounding sequences not only to distinguish STR classes, as defined by the repeated DNA motif, one from each other, but also to predict their transcription. Excitingly, our models predicted that genetic variants linked to human diseases affect STR-associated transcription and correspond precisely to the key positions identified by our models to predict transcription. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.

Download Full-text

Importance of experimental information (metadata) for archived sequence data: case of specific gene bias due to lag time between sample harvest and RNA protection in RNA sequencing

PeerJ ◽

10.7717/peerj.11875 ◽

2021 ◽

Vol 9 ◽

pp. e11875

Author(s):

Tomoko Matsuda

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Time Course ◽

Sequence Data ◽

Specific Gene ◽

Time Interval ◽

Short Time Interval ◽

Rna Seq ◽

Lysis Buffer ◽

Rna Protection

Large volumes of high-throughput sequencing data have been submitted to the Sequencing Read Archive (SRA). The lack of experimental metadata associated with the data makes reuse and understanding data quality very difficult. In the case of RNA sequencing (RNA-Seq), which reveals the presence and quantity of RNA in a biological sample at any moment, it is necessary to consider that gene expression responds over a short time interval (several seconds to a few minutes) in many organisms. Therefore, to isolate RNA that accurately reflects the transcriptome at the point of harvest, raw biological samples should be processed by freezing in liquid nitrogen, immersing in RNA stabilization reagent or lysing and homogenizing in RNA lysis buffer containing guanidine thiocyanate as soon as possible. As the number of samples handled simultaneously increases, the time until the RNA is protected can increase. Here, to evaluate the effect of different lag times in RNA protection on RNA-Seq data, we harvested CHO-S cells after 3, 5, 6, and 7 days of cultivation, added RNA lysis buffer in a time course of 15, 30, 45, and 60 min after harvest, and conducted RNA-Seq. These RNA samples showed high RNA integrity number (RIN) values indicating non-degraded RNA, and sequence data from libraries prepared with these RNA samples was of high quality according to FastQC. We observed that, at the same cultivation day, global trends of gene expression were similar across the time course of addition of RNA lysis buffer; however, the expression of some genes was significantly different between the time-course samples of the same cultivation day; most of these differentially expressed genes were related to apoptosis. We conclude that the time lag between sample harvest and RNA protection influences gene expression of specific genes. It is, therefore, necessary to know not only RIN values of RNA and the quality of the sequence data but also how the experiment was performed when acquiring RNA-Seq data from the database.

Download Full-text