scholarly journals Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data

2021 ◽  
Author(s):  
Dongze He ◽  
Mohsen Zakeri ◽  
Hirak Sarkar ◽  
Charlotte Soneson ◽  
Avi Srivastava ◽  
...  

The rapid growth of high-throughput single-cell and single-nucleus RNA sequencing technologies has produced a wealth of data over the past few years. The available technologies continue to evolve and experiments continue to increase in both number and scale. The size, volume, and distinctive characteristics of these data necessitate the development of new software and associated computational methods to accurately and efficiently quantify single-cell and single-nucleus RNA-seq data into count matrices that constitute the input to downstream analyses. We introduce the alevin-fry framework for quantifying single-cell and single-nucleus RNA-seq data. Despite being faster and more memory frugal than other accurate and scalable quantification approaches, alevin-fry does not suffer from the false positive expression or memory scalability issues that are exhibited by other lightweight tools. We demonstrate how alevin-fry can be effectively used to quantify single-cell and single nucleus RNA-seq data, and also how the spliced and unspliced molecule quantification required as input for RNA velocity analyses can be seamlessly extracted from the same preprocessed data used to generate regular gene expression count matrices.

2019 ◽  
Author(s):  
Marcus Alvarez ◽  
Elior Rahmani ◽  
Brandon Jew ◽  
Kristina M. Garske ◽  
Zong Miao ◽  
...  

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.


2018 ◽  
Author(s):  
Kedar Nath Natarajan ◽  
Zhichao Miao ◽  
Miaomiao Jiang ◽  
Xiaoyun Huang ◽  
Hongpo Zhou ◽  
...  

AbstractAll single-cell RNA-seq protocols and technologies require library preparation prior to sequencing on a platform such as Illumina. Here, we present the first report to utilize the BGISEQ-500 platform for scRNA-seq, and compare the sensitivity and accuracy to Illumina sequencing. We generate a scRNA-seq resource of 468 unique single-cells and 1,297 matched single cDNA samples, performing SMARTer and Smart-seq2 protocols on mESCs and K562 cells with RNA spike-ins. We sequence these libraries on both BGISEQ-500 and Illumina HiSeq platforms using single- and paired-end reads. The two platforms have comparable sensitivity and accuracy in terms of quantification of gene expression, and low technical variability. Our study provides a standardised scRNA-seq resource to benchmark new scRNA-seq library preparation protocols and sequencing platforms.


2016 ◽  
Author(s):  
Bo Wang ◽  
Junjie Zhu ◽  
Emma Pierson ◽  
Daniele Ramazzotti ◽  
Serafim Batzoglou

AbstractSingle-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. Here, we propose a novel similarity-learning framework, SIMLR (single-cell interpretation via multi-kernel learning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization applications. Benchmarking against state-of-the-art methods for these applications, we used SIMLR to re-analyse seven representative single-cell data sets, including high-throughput droplet-based data sets with tens of thousands of cells. We show that SIMLR greatly improves clustering sensitivity and accuracy, as well as the visualization and interpretability of the data.


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. 4633-4633
Author(s):  
William L. Hwang ◽  
Karthik Jagadeesh ◽  
Jimmy Guo ◽  
Hannah I. Hoffman ◽  
Orr Ashenberg ◽  
...  

4633 Background: Pancreatic ductal adenocarcinoma (PDAC) remains a treatment-refractory disease and existing molecular subtypes do not inform clinical decisions. Previously identified bulk transcriptomic subtypes of PDAC were often unintentionally driven by “contaminating” stroma. RNA extraction from pancreatic tissue is difficult and prior single-cell RNA-seq efforts have been limited by suboptimal dissociation/RNA quality and poor performance in the setting of neoadjuvant treatment. We developed a robust single-nucleus RNA-seq (sNuc-seq) technique compatible with frozen archival PDAC specimens. Methods: Single nuclei suspensions were extracted from frozen primary PDAC specimens (n = 27) derived from patients with (borderline)-resectable PDAC who underwent surgical resection with or without neoadjuvant chemoradiotherapy (CRT). Approximately 170,000 nuclei were processed with the 10x Genomics Single Cell 3’ v3 pipeline and gene expression libraries were sequenced (Illumina HiSeq X). Results: Distinct nuclei clusters with gene expression profiles/inferred copy number variation analysis consistent with neoplastic, acinar, ductal, fibroblast, endothelial, endocrine, lymphocyte, and myeloid populations were identified with proportions similar to corresponding multiplexed ion beam imaging. Non-negative matrix factorization revealed intra-tumoral heterogeneity shared across patients. Neoplastic cells featured eight distinct transcriptional topics characterized by developmental (epithelial, mesenchymal, endoderm progenitor, neural progenitor) and environmental (anabolic, catabolic, cycling, hypoxic) programs. CAFs exhibited four different transcriptional topics (activated/desmoplastic, myofibroblast, neurogenic, osteochondral). Differential gene expression and gene set enrichment analyses demonstrated that CRT was associated with an enrichment in myogenic programs in CAFs, activation pathways in immune cells, and type I/II interferons in malignant cells. CRT was also associated with a depletion in developmental programs within malignant cells. Conclusions: We uncovered significant intratumoral heterogeneity and treatment-associated differences in the malignant, fibroblast, and immune compartments of PDAC using sNuc-seq. Deconvolution of clinically-annotated bulk RNA-seq cohorts and characterization of intercellular interactions with receptor-ligand analysis and spatial transcriptomics are ongoing.


2015 ◽  
Author(s):  
Stephanie C Hicks ◽  
F. William Townes ◽  
Mingxiang Teng ◽  
Rafael A Irizarry

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-Seq and scRNA-seq data are markedly different. In particular, unlike RNA-Seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, gene expressing RNA, but not at a sufficient level to detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.


2018 ◽  
Author(s):  
Daniel Alpern ◽  
Vincent Gardeux ◽  
Julie Russeil ◽  
Bart Deplancke

ABSTRACTGenome-wide gene expression analyses by RNA sequencing (RNA-seq) have quickly become a standard in molecular biology because of the widespread availability of high throughput sequencing technologies. While powerful, RNA-seq still has several limitations, including the time and cost of library preparation, which makes it difficult to profile many samples simultaneously. To deal with these constraints, the single-cell transcriptomics field has implemented the early multiplexing principle, making the library preparation of hundreds of samples (cells) markedly more affordable. However, the current standard methods for bulk transcriptomics (such as TruSeq Stranded mRNA) remain expensive, and relatively little effort has been invested to develop cheaper, but equally robust methods. Here, we present a novel approach, Bulk RNA Barcoding and sequencing (BRB-seq), that combines the multiplexing-driven cost-effectiveness of a single-cell RNA-seq workflow with the performance of a bulk RNA-seq procedure. BRB-seq produces 3’ enriched cDNA libraries that exhibit similar gene expression quantification to TruSeq and that maintain this quality, also in terms of number of detected differentially expressed genes, even with low quality RNA samples. We show that BRB-seq is about 25 times less expensive than TruSeq, enabling the generation of ready to sequence libraries for up to 192 samples in a day with only 2 hours of hands-on time. We conclude that BRB-seq constitutes a powerful alternative to TruSeq as a standard bulk RNA-seq approach. Moreover, we anticipate that this novel method will eventually replace RT-qPCR-based gene expression screens given its capacity to generate genome-wide transcriptomic data at a cost that is comparable to profiling 4 genes using RT-qPCR.‘SoftwareWe developed a suite of open source tools (BRB-seqTools) to aid with processing BRB-seq data and generating count matrices that are used for further analyses. This suite can perform demultiplexing, generate count/UMI matrices and trim BRB-seq constructs and is freely available at http://github.com/DeplanckeLab/BRB-seqToolsHighlightsRapid (~2h hands on time) and low-cost approach to perform transcriptomics on hundreds of RNA samplesStrand specificity preservedPerformance: number of detected genes is equal to Illumina TruSeq Stranded mRNA at same sequencing depthHigh capacity: low cost allows increasing the number of biological replicatesProduces reliable data even with low quality RNA samples (down to RIN value = 2)Complete user-friendly sequencing data pre-processing and analysis pipeline allowing result acquisition in a day


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Matthieu Dos Santos ◽  
Stéphanie Backer ◽  
Benjamin Saintpierre ◽  
Brigitte Izac ◽  
Muriel Andrieu ◽  
...  

Abstract Skeletal muscle fibers are large syncytia but it is currently unknown whether gene expression is coordinately regulated in their numerous nuclei. Here we show by snRNA-seq and snATAC-seq that slow, fast, myotendinous and neuromuscular junction myonuclei each have different transcriptional programs, associated with distinct chromatin states and combinations of transcription factors. In adult mice, identified myofiber types predominantly express either a slow or one of the three fast isoforms of Myosin heavy chain (MYH) proteins, while a small number of hybrid fibers can express more than one MYH. By snRNA-seq and FISH, we show that the majority of myonuclei within a myofiber are synchronized, coordinately expressing only one fast Myh isoform with a preferential panel of muscle-specific genes. Importantly, this coordination of expression occurs early during post-natal development and depends on innervation. These findings highlight a previously undefined mechanism of coordination of gene expression in a syncytium.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tracy M. Yamawaki ◽  
Daniel R. Lu ◽  
Daniel C. Ellwanger ◽  
Dev Bhatt ◽  
Paolo Manzanillo ◽  
...  

Abstract Background Elucidation of immune populations with single-cell RNA-seq has greatly benefited the field of immunology by deepening the characterization of immune heterogeneity and leading to the discovery of new subtypes. However, single-cell methods inherently suffer from limitations in the recovery of complete transcriptomes due to the prevalence of cellular and transcriptional dropout events. This issue is often compounded by limited sample availability and limited prior knowledge of heterogeneity, which can confound data interpretation. Results Here, we systematically benchmarked seven high-throughput single-cell RNA-seq methods. We prepared 21 libraries under identical conditions of a defined mixture of two human and two murine lymphocyte cell lines, simulating heterogeneity across immune-cell types and cell sizes. We evaluated methods by their cell recovery rate, library efficiency, sensitivity, and ability to recover expression signatures for each cell type. We observed higher mRNA detection sensitivity with the 10x Genomics 5′ v1 and 3′ v3 methods. We demonstrate that these methods have fewer dropout events, which facilitates the identification of differentially-expressed genes and improves the concordance of single-cell profiles to immune bulk RNA-seq signatures. Conclusion Overall, our characterization of immune cell mixtures provides useful metrics, which can guide selection of a high-throughput single-cell RNA-seq method for profiling more complex immune-cell heterogeneity usually found in vivo.


2017 ◽  
Vol 37 (17) ◽  
pp. 12-13
Author(s):  
Jennifer Chew ◽  
Adam Bemis ◽  
Ronald Lebofsky ◽  
Anna Quinlan ◽  
Kelly Kaihara
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document