scholarly journals Red panda: a novel method for detecting variants in single-cell RNA sequencing

BMC Genomics ◽  
2020 ◽  
Vol 21 (S11) ◽  
Author(s):  
Adam Cornish ◽  
Shrabasti Roychoudhury ◽  
Krishna Sarma ◽  
Suravi Pramanik ◽  
Kishor Bhakat ◽  
...  

Abstract Background Single-cell sequencing enables us to better understand genetic diseases, such as cancer or autoimmune disorders, which are often affected by changes in rare cells. Currently, no existing software is aimed at identifying single nucleotide variations or micro (1-50 bp) insertions and deletions in single-cell RNA sequencing (scRNA-seq) data. Generating high-quality variant data is vital to the study of the aforementioned diseases, among others. Results In this study, we report the design and implementation of Red Panda, a novel method to accurately identify variants in scRNA-seq data. Variants were called on scRNA-seq data from human articular chondrocytes, mouse embryonic fibroblasts (MEFs), and simulated data stemming from the MEF alignments. Red Panda had the highest Positive Predictive Value at 45.0%, while other tools—FreeBayes, GATK HaplotypeCaller, GATK UnifiedGenotyper, Monovar, and Platypus—ranged from 5.8–41.53%. From the simulated data, Red Panda had the highest sensitivity at 72.44%. Conclusions We show that our method provides a novel and improved mechanism to identify variants in scRNA-seq as compared to currently existing software. However, methods for identification of genomic variants using scRNA-seq data can be still improved.

Author(s):  
Adam Cornish ◽  
Shrabasti Roychoudhury ◽  
Krishna Sarma ◽  
Suravi Pramanik ◽  
Kishor Bhakat ◽  
...  

AbstractSingle-cell sequencing enables us to better understand genetic diseases, such as cancer or autoimmune disorders, which are often affected by changes in rare cells. Currently, no existing software is aimed at identifying single nucleotide variations or micro (1-50bp) insertions and deletions in single-cell RNA sequencing (scRNA-seq) data. Generating high-quality variant data is vital to the study of the aforementioned diseases, among others. In this study, we report the design and implementation of Red Panda, a novel method to accurately identify variants in scRNA-seq data. Variants were called on scRNA-seq data from human articular chondrocytes, mouse embryonic fibroblasts (MEFs), and simulated data stemming from the MEF alignments. Red Panda had the highest Positive Predictive Value at 45.0%, while other tools—FreeBayes, GATK HaplotypeCaller, GATK UnifiedGenotyper, Monovar, and Platypus—ranged from 5.8%-41.53%. From the simulated data, Red Panda had the highest sensitivity at 72.44%. We show that our method provides a novel and improved mechanism to identify variants in scRNA-seq as compared to currently-existing software.AvailabilitySource code freely available under the MIT License at https://github.com/adambioi/red_panda, and is supported on Linux


2021 ◽  
Author(s):  
Alex Rogozhnikov ◽  
Pavan Ramkumar ◽  
Saul Kato ◽  
Sean Escola

Demultiplexing methods have facilitated the widespread use of single-cell RNA sequencing (scRNAseq) experiments by lowering costs and reducing technical variations. Here, we present demuxalot: a method for probabilistic genotype inference from aligned reads, with no assumptions about allele ratios and efficient incorporation of prior genotype information from historical experiments in a multi-batch setting. Our method efficiently incorporates additional information across reads originating from the same transcript, enabling up to 3x more calls per read relative to naive approaches. We also propose a novel and highly performant tradeoff between methods that rely on reference genotypes and methods that learn variants from the data, by selecting a small number of highly informative variants that maximize the marginal information with respect to reference single nucleotide variants (SNVs). Our resulting improved SNV-based demultiplex method is up to 3x faster, 3x more data efficient, and achieves significantly more accurate doublet discrimination than previously published methods. This approach renders scRNAseq feasible for the kind of large multi-batch, multi-donor studies that are required to prosecute diseases with heterogeneous genetic backgrounds.


Genes ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 240 ◽  
Author(s):  
Prashant N. M. ◽  
Hongyu Liu ◽  
Pavlos Bousounis ◽  
Liam Spurr ◽  
Nawaf Alomran ◽  
...  

With the recent advances in single-cell RNA-sequencing (scRNA-seq) technologies, the estimation of allele expression from single cells is becoming increasingly reliable. Allele expression is both quantitative and dynamic and is an essential component of the genomic interactome. Here, we systematically estimate the allele expression from heterozygous single nucleotide variant (SNV) loci using scRNA-seq data generated on the 10×Genomics Chromium platform. We analyzed 26,640 human adipose-derived mesenchymal stem cells (from three healthy donors), sequenced to an average of 150K sequencing reads per cell (more than 4 billion scRNA-seq reads in total). High-quality SNV calls assessed in our study contained approximately 15% exonic and >50% intronic loci. To analyze the allele expression, we estimated the expressed variant allele fraction (VAFRNA) from SNV-aware alignments and analyzed its variance and distribution (mono- and bi-allelic) at different minimum sequencing read thresholds. Our analysis shows that when assessing positions covered by a minimum of three unique sequencing reads, over 50% of the heterozygous SNVs show bi-allelic expression, while at a threshold of 10 reads, nearly 90% of the SNVs are bi-allelic. In addition, our analysis demonstrates the feasibility of scVAFRNA estimation from current scRNA-seq datasets and shows that the 3′-based library generation protocol of 10×Genomics scRNA-seq data can be informative in SNV-based studies, including analyses of transcriptional kinetics.


2020 ◽  
Author(s):  
Weimiao Wu ◽  
Qile Dai ◽  
Yunqing Liu ◽  
Xiting Yan ◽  
Zuoheng Wang

AbstractSingle-cell RNA sequencing provides an opportunity to study gene expression at single-cell resolution. However, prevalent dropout events result in high data sparsity and noise that may obscure downstream analyses. We propose a novel method, G2S3, that imputes dropouts by borrowing information from adjacent genes in a sparse gene graph learned from gene expression profiles across cells. We applied G2S3 and other existing methods to seven single-cell datasets to compare their performance. Our results demonstrated that G2S3 is superior in recovering true expression levels, identifying cell subtypes, improving differential expression analyses, and recovering gene regulatory relationships, especially for mildly expressed genes.


2020 ◽  
Vol 48 (1) ◽  
pp. 327-336 ◽  
Author(s):  
L.E. Zaragosi ◽  
M. Deprez ◽  
P. Barbry

The respiratory tract is lined by a pseudo-stratified epithelium from the nose to terminal bronchioles. This first line of defense of the lung against external stress includes five main cell types: basal, suprabasal, club, goblet and multiciliated cells, as well as rare cells such as ionocytes, neuroendocrine and tuft/brush cells. At homeostasis, this epithelium self-renews at low rate but is able of fast regeneration upon damage. Airway epithelial cell lineages during regeneration have been investigated in the mouse by genetic labeling, mainly after injuring the epithelium with noxious agents. From these approaches, basal cells have been identified as progenitors of club, goblet and multiciliated cells, but also of ionocytes and neuroendocrine cells. Single-cell RNA sequencing, coupled to lineage inference algorithms, has independently allowed the establishment of comprehensive pictures of cell lineage relationships in both mouse and human. In line with genetic tracing experiments in mouse trachea, studies using single-cell RNA sequencing (RNAseq) have shown that basal cells first differentiate into club cells, which in turn mature into goblet cells or differentiate into multiciliated cells. In the human airway epithelium, single-cell RNAseq has identified novel intermediate populations such as deuterosomal cells, ‘hybrid’ mucous-multiciliated cells and progenitors of rare cells. Novel differentiation dynamics, such as a transition from goblet to multiciliated cells have also been discovered. The future of cell lineage relationships in the respiratory tract now resides in the combination of genetic labeling approaches with single-cell RNAseq to establish, in a definitive manner, the hallmarks of cellular lineages in normal and pathological situations.


2020 ◽  
Vol 31 (9) ◽  
pp. 1977-1986 ◽  
Author(s):  
Andrew F. Malone ◽  
Haojia Wu ◽  
Catrina Fronick ◽  
Robert Fulton ◽  
Joseph P. Gaut ◽  
...  

BackgroundIn solid organ transplantation, donor-derived immune cells are assumed to decline with time after surgery. Whether donor leukocytes persist within kidney transplants or play any role in rejection is unknown, however, in part because of limited techniques for distinguishing recipient from donor cells.MethodsWhole-exome sequencing of donor and recipient DNA and single-cell RNA sequencing (scRNA-seq) of five human kidney transplant biopsy cores distinguished immune cell contributions from both participants. DNA-sequence comparisons used single nucleotide variants (SNVs) identified in the exome sequences across all samples.ResultsAnalysis of expressed SNVs in the scRNA-seq data set distinguished recipient versus donor origin for all 81,139 cells examined. The leukocyte donor/recipient ratio varied with rejection status for macrophages and with time post-transplant for lymphocytes. Recipient macrophages displayed inflammatory activation whereas donor macrophages demonstrated antigen presentation and complement signaling. Recipient-origin T cells expressed cytotoxic and proinflammatory genes consistent with an effector cell phenotype, whereas donor-origin T cells appeared quiescent, expressing oxidative phosphorylation genes. Finally, both donor and recipient T cell clones within the rejecting kidney suggested lymphoid aggregation. The results indicate that donor-origin macrophages and T cells have distinct transcriptional profiles compared with their recipient counterparts, and that donor macrophages can persist for years post-transplantation.ConclusionsAnalysis of single nucleotide variants and their expression in single cells provides a powerful novel approach to accurately define leukocyte chimerism in a complex organ such as a transplanted kidney, coupled with the ability to examine transcriptional profiles at single-cell resolution.PodcastThis article contains a podcast at https://www.asn-online.org/media/podcast/JASN/2020_08_07_JASN2020030326.mp3


2021 ◽  
Author(s):  
Xiaowen Cao ◽  
Li Xing ◽  
Elham Majd ◽  
Hua He ◽  
Junhua Gu ◽  
...  

Abstract Background: Single-cell RNA sequencing (scRNA-seq) yields valuable insights about gene expression and gives critical information about complex tissue cellular composition. In the analysis of single-cell RNA sequencing, the annotations of cell subtypes are often done manually, which is time-consuming and irreproducible. Garnett is a cell-type annotation software based the on elastic net method. Beside cell-type annotation, supervised machine learning methods can also be applied to predict other cell phenotypes from genomic data. Despite the popularity of such applications, there is no existing study to systematically investigate the performance of those supervised algorithms in various sizes of scRNA-seq data sets. Methods and Results: This study evaluates 13 popular supervised machine learning algorithms to classify cell phenotypes, using published real and simulated data sets with diverse cell sizes. The benchmark contained two parts. In the first part, we used real data sets to assess the popular supervised algorithms’ computing speed and cell phenotype classification performance. The classification performances were evaluated using AUC statistics, F1-score, precision, recall, and false-positive rate. In the second part, we evaluated gene selection performance using published simulated data sets with a known list of real genes. Conclusion: The study outcomes showed that ElasticNet with interactions performed best in small and medium data sets. NB was another appropriate method for medium data sets. In large data sets, XGB works excellent. Ensemble algorithms were not significantly superior to individual machine learning methods. Adding interactions to ElasticNet can help, and the improvement was significant in small data sets.


2021 ◽  
Author(s):  
Helena L Crowell ◽  
Sarah X Morillo Leonardo ◽  
Charlotte Soneson ◽  
Mark D Robinson

With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyse aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant - on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task, and often use simulated data that provide a ground truth for evaluations. Thus, demanding a high quality standard for synthetically generated data is critical to make simulation study results credible and transferable to real data. Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects; they yield over-optimistic performance of integration, and potentially unreliable ranking of clustering methods; and, it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.


BMC Genomics ◽  
2020 ◽  
Vol 21 (S9) ◽  
Author(s):  
Siamak Zamani Dadaneh ◽  
Paul de Figueiredo ◽  
Sing-Hoi Sze ◽  
Mingyuan Zhou ◽  
Xiaoning Qian

Abstract Background Single-cell RNA sequencing (scRNA-seq) is a powerful profiling technique at the single-cell resolution. Appropriate analysis of scRNA-seq data can characterize molecular heterogeneity and shed light into the underlying cellular process to better understand development and disease mechanisms. The unique analytic challenge is to appropriately model highly over-dispersed scRNA-seq count data with prevalent dropouts (zero counts), making zero-inflated dimensionality reduction techniques popular for scRNA-seq data analyses. Employing zero-inflated distributions, however, may place extra emphasis on zero counts, leading to potential bias when identifying the latent structure of the data. Results In this paper, we propose a fully generative hierarchical gamma-negative binomial (hGNB) model of scRNA-seq data, obviating the need for explicitly modeling zero inflation. At the same time, hGNB can naturally account for covariate effects at both the gene and cell levels to identify complex latent representations of scRNA-seq data, without the need for commonly adopted pre-processing steps such as normalization. Efficient Bayesian model inference is derived by exploiting conditional conjugacy via novel data augmentation techniques. Conclusion Experimental results on both simulated data and several real-world scRNA-seq datasets suggest that hGNB is a powerful tool for cell cluster discovery as well as cell lineage inference.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Fenglin Liu ◽  
Yuanyuan Zhang ◽  
Lei Zhang ◽  
Ziyi Li ◽  
Qiao Fang ◽  
...  

Abstract Background Systematic interrogation of single-nucleotide variants (SNVs) is one of the most promising approaches to delineate the cellular heterogeneity and phylogenetic relationships at the single-cell level. While SNV detection from abundant single-cell RNA sequencing (scRNA-seq) data is applicable and cost-effective in identifying expressed variants, inferring sub-clones, and deciphering genotype-phenotype linkages, there is a lack of computational methods specifically developed for SNV calling in scRNA-seq. Although variant callers for bulk RNA-seq have been sporadically used in scRNA-seq, the performances of different tools have not been assessed. Results Here, we perform a systematic comparison of seven tools including SAMtools, the GATK pipeline, CTAT, FreeBayes, MuTect2, Strelka2, and VarScan2, using both simulation and scRNA-seq datasets, and identify multiple elements influencing their performance. While the specificities are generally high, with sensitivities exceeding 90% for most tools when calling homozygous SNVs in high-confident coding regions with sufficient read depths, such sensitivities dramatically decrease when calling SNVs with low read depths, low variant allele frequencies, or in specific genomic contexts. SAMtools shows the highest sensitivity in most cases especially with low supporting reads, despite the relatively low specificity in introns or high-identity regions. Strelka2 shows consistently good performance when sufficient supporting reads are provided, while FreeBayes shows good performance in the cases of high variant allele frequencies. Conclusions We recommend SAMtools, Strelka2, FreeBayes, or CTAT, depending on the specific conditions of usage. Our study provides the first benchmarking to evaluate the performances of different SNV detection tools for scRNA-seq data.


Sign in / Sign up

Export Citation Format

Share Document