scholarly journals Estimating the Allele-Specific Expression of SNVs From 10× Genomics Single-Cell RNA-Sequencing Data

Genes ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 240 ◽  
Author(s):  
Prashant N. M. ◽  
Hongyu Liu ◽  
Pavlos Bousounis ◽  
Liam Spurr ◽  
Nawaf Alomran ◽  
...  

With the recent advances in single-cell RNA-sequencing (scRNA-seq) technologies, the estimation of allele expression from single cells is becoming increasingly reliable. Allele expression is both quantitative and dynamic and is an essential component of the genomic interactome. Here, we systematically estimate the allele expression from heterozygous single nucleotide variant (SNV) loci using scRNA-seq data generated on the 10×Genomics Chromium platform. We analyzed 26,640 human adipose-derived mesenchymal stem cells (from three healthy donors), sequenced to an average of 150K sequencing reads per cell (more than 4 billion scRNA-seq reads in total). High-quality SNV calls assessed in our study contained approximately 15% exonic and >50% intronic loci. To analyze the allele expression, we estimated the expressed variant allele fraction (VAFRNA) from SNV-aware alignments and analyzed its variance and distribution (mono- and bi-allelic) at different minimum sequencing read thresholds. Our analysis shows that when assessing positions covered by a minimum of three unique sequencing reads, over 50% of the heterozygous SNVs show bi-allelic expression, while at a threshold of 10 reads, nearly 90% of the SNVs are bi-allelic. In addition, our analysis demonstrates the feasibility of scVAFRNA estimation from current scRNA-seq datasets and shows that the 3′-based library generation protocol of 10×Genomics scRNA-seq data can be informative in SNV-based studies, including analyses of transcriptional kinetics.

Author(s):  
N M Prashant ◽  
Hongyu Liu ◽  
Pavlos Bousounis ◽  
Liam Spurr ◽  
Nawaf Alomran ◽  
...  

AbstractWith the recent advances in single-cell RNA-sequencing (scRNA-seq) technologies, estimation of allele expression from single cells is becoming increasingly reliable. Allele expression is both quantitative and dynamic and is an essential component of the genomic interactome. Here, we systematically estimate allele expression from heterozygous single nucleotide variant (SNV) loci using scRNA-seq data generated on the 10x Genomics platform. We include in the analysis 26,640 human adipose-derived mesenchymal stem cells (from three healthy donors), with an average sequencing reads over 120K/cell (more than 4 billion scRNA-seq reads total). High quality SNV calls assessed in our study contained approximately 15% exonic and >50% intronic loci. To analyze the allele expression, we estimate the expressed Variant Allele Fraction (VAFRNA) from SNV-aware alignments and analyze its variance and distribution (mono- and bi-allelic) at different cutoffs for required minimal number of sequencing reads. Our analysis shows that when assessing SNV loci covered by a minimum of 3 unique sequencing reads, over 50% of the heterozygous SNVs show bi-allelic expression, while at minimum of 10 reads, nearly 90% of the SNVs are bi-allelic. Consistent with single cell studies on RNA velocity and models of transcriptional burst kinetics, we observe a substantially higher rate of monoallelic expression among intronic SNVs, signifying the usefulness of scVAFRNA to assess dynamic cellular processes. Our analysis demonstrates the feasibility of scVAFRNA estimation from current scRNA-seq datasets and shows that the 3’-based library generation protocol of 10x Genomics scRNA-seq data can be highly informative in SNV-based analyses.


2019 ◽  
Author(s):  
Imad Abugessaisa ◽  
Shuhei Noguchi ◽  
Melissa Cardon ◽  
Akira Hasegawa ◽  
Kazuhide Watanabe ◽  
...  

AbstractAnalysis and interpretation of single-cell RNA-sequencing (scRNA-seq) experiments are compromised by the presence of poor quality cells. For meaningful analyses, such poor quality cells should be excluded to avoid biases and large variation. However, no clear guidelines exist. We introduce SkewC, a novel quality-assessment method to identify poor quality single-cells in scRNA-seq experiments. The method is based on the assessment of gene coverage for each single cell and its skewness as a quality measure. To validate the method, we investigated the impact of poor quality cells on downstream analyses and compared biological differences between typical and poor quality cells. Moreover, we measured the ratio of intergenic expression, suggesting genomic contamination, and foreign organism contamination of single-cell samples. SkewC is tested in 37,993 single-cells generated by 15 scRNA-seq protocols. We envision SkewC as an indispensable QC method to be incorporated into scRNA-seq experiment to preclude the possibility of scRNA-seq data misinterpretation.


2019 ◽  
Vol 28 (21) ◽  
pp. 3569-3583 ◽  
Author(s):  
Patricia M Schnepp ◽  
Mengjie Chen ◽  
Evan T Keller ◽  
Xiang Zhou

Abstract Integrating single-cell RNA sequencing (scRNA-seq) data with genotypes obtained from DNA sequencing studies facilitates the detection of functional genetic variants underlying cell type-specific gene expression variation. Unfortunately, most existing scRNA-seq studies do not come with DNA sequencing data; thus, being able to call single nucleotide variants (SNVs) from scRNA-seq data alone can provide crucial and complementary information, detection of functional SNVs, maximizing the potential of existing scRNA-seq studies. Here, we perform extensive analyses to evaluate the utility of two SNV calling pipelines (GATK and Monovar), originally designed for SNV calling in either bulk or single-cell DNA sequencing data. In both pipelines, we examined various parameter settings to determine the accuracy of the final SNV call set and provide practical recommendations for applied analysts. We found that combining all reads from the single cells and following GATK Best Practices resulted in the highest number of SNVs identified with a high concordance. In individual single cells, Monovar resulted in better quality SNVs even though none of the pipelines analyzed is capable of calling a reasonable number of SNVs with high accuracy. In addition, we found that SNV calling quality varies across different functional genomic regions. Our results open doors for novel ways to leverage the use of scRNA-seq for the future investigation of SNV function.


Author(s):  
Abha S Bais ◽  
Dennis Kostka

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Results With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. Availability and implementation scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Li Lin ◽  
Minfang Song ◽  
Yong Jiang ◽  
Xiaojing Zhao ◽  
Haopeng Wang ◽  
...  

ABSTRACTNormalization with respect to sequencing depth is a crucial step in single-cell RNA sequencing preprocessing. Most methods normalize data using the whole transcriptome based on the assumption that the majority of transcriptome remains constant and are unable to detect drastic changes of the transcriptome. Here, we develop an algorithm based on a small fraction of constantly expressed genes as internal spike-ins to normalize single cell RNA sequencing data. We demonstrate that the transcriptome of single cells may undergo drastic changes in several case study datasets and accounting for such heterogeneity by ISnorm improves the performance of downstream analyzes.


2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Xin Zhao ◽  
Shouguo Gao ◽  
Sachiko Kajigaya ◽  
Qingguo Liu ◽  
Zhijie Wu ◽  
...  

Abstract Objective Single cell methodology enables detection and quantification of transcriptional changes and unravelling dynamic aspects of the transcriptional heterogeneity not accessible using bulk sequencing approaches. We have applied single-cell RNA-sequencing (scRNA-seq) to fresh human bone marrow CD34+ cells and profiled 391 single hematopoietic stem/progenitor cells (HSPCs) from healthy donors to characterize lineage- and stage-specific transcription during hematopoiesis. Results Cells clustered into six distinct groups, which could be assigned to known HSPC subpopulations based on lineage specific genes. Reconstruction of differentiation trajectories in single cells revealed four committed lineages derived from HSCs, as well as dynamic expression changes underlying cell fate during early erythroid-megakaryocytic, lymphoid, and granulocyte-monocyte differentiation. A similar non-hierarchical pattern of hematopoiesis could be derived from analysis of published single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), consistent with a sequential relationship between chromatin dynamics and regulation of gene expression during lineage commitment (first, altered chromatin conformation, then mRNA transcription). Computationally, we have reconstructed molecular trajectories connecting HSCs directly to four hematopoietic lineages. Integration of long noncoding RNA (lncRNA) expression from the same cells demonstrated mRNA transcriptome, lncRNA, and the epigenome were highly homologous in their pattern of gene activation and suppression during hematopoietic cell differentiation.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Fenglin Liu ◽  
Yuanyuan Zhang ◽  
Lei Zhang ◽  
Ziyi Li ◽  
Qiao Fang ◽  
...  

Abstract Background Systematic interrogation of single-nucleotide variants (SNVs) is one of the most promising approaches to delineate the cellular heterogeneity and phylogenetic relationships at the single-cell level. While SNV detection from abundant single-cell RNA sequencing (scRNA-seq) data is applicable and cost-effective in identifying expressed variants, inferring sub-clones, and deciphering genotype-phenotype linkages, there is a lack of computational methods specifically developed for SNV calling in scRNA-seq. Although variant callers for bulk RNA-seq have been sporadically used in scRNA-seq, the performances of different tools have not been assessed. Results Here, we perform a systematic comparison of seven tools including SAMtools, the GATK pipeline, CTAT, FreeBayes, MuTect2, Strelka2, and VarScan2, using both simulation and scRNA-seq datasets, and identify multiple elements influencing their performance. While the specificities are generally high, with sensitivities exceeding 90% for most tools when calling homozygous SNVs in high-confident coding regions with sufficient read depths, such sensitivities dramatically decrease when calling SNVs with low read depths, low variant allele frequencies, or in specific genomic contexts. SAMtools shows the highest sensitivity in most cases especially with low supporting reads, despite the relatively low specificity in introns or high-identity regions. Strelka2 shows consistently good performance when sufficient supporting reads are provided, while FreeBayes shows good performance in the cases of high variant allele frequencies. Conclusions We recommend SAMtools, Strelka2, FreeBayes, or CTAT, depending on the specific conditions of usage. Our study provides the first benchmarking to evaluate the performances of different SNV detection tools for scRNA-seq data.


Author(s):  
Cornelia Fuetterer ◽  
Thomas Augustin ◽  
Christiane Fuchs

AbstractThe analysis of single-cell RNA sequencing data is of great importance in health research. It challenges data scientists, but has enormous potential in the context of personalized medicine. The clustering of single cells aims to detect different subgroups of cell populations within a patient in a data-driven manner. Some comparison studies denote single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483–486, 2017), as the best method for classifying single-cell RNA sequencing data. SC3 includes Laplacian eigenmaps and a principal component analysis (PCA). Our proposal of unsupervised adapted single-cell consensus clustering (adaSC3) suggests to replace the linear PCA by diffusion maps, a non-linear method that takes the transition of single cells into account. We investigate the performance of adaSC3 in terms of accuracy on the data sets of the original source of SC3 as well as in a simulation study. A comparison of adaSC3 with SC3 as well as with related algorithms based on further alternative dimension reduction techniques shows a quite convincing behavior of adaSC3.


Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 995-995
Author(s):  
Vincent-Philippe Lavallee ◽  
Elham Azizi ◽  
Vaidotas Kiseliovas ◽  
Ignas Masilionis ◽  
Linas Mazutis ◽  
...  

Abstract Introduction: Acute myeloid leukemia (AML) evolution is a multistep process in which cells evolve from hematopoietic stem and progenitor cells (HSPCs) that acquire genetic anomalies, such as chromosomal rearrangements and mutations, which define distinct subgroups. Mutations in Nucleophosmin 1 (NPM1), which occur in ~30% patients, are the most frequent subgroup-defining mutations in AML and appear to be a late driver event in this disease. Bulk RNA-sequencing studies have identified differentially expressed genes between AML subgroups, but they are uninformative of the composition of cell types populating each sample. Large scale Single-cell RNA sequencing (scRNA-seq) technologies now enable a detailed characterization of intra tumoral heterogeneity, and could help to better understand the stepwise evolution from normal to malignant cells. Methods: Twelve primary human AML specimens from MSKCC and Quebec Leukemia Cell Bank, including 8 with NPM1 mutations, were included in this cohort. Cells were subjected to scRNA-seq using 10X Genomics Chromium Single Cell 3' protocols and libraries were sequenced on Illumina HiSeq or NovaSeq platforms. FASTQ files were processed using SEQC pipeline (Azizi E et al, Cell 2018), resulting in a carefully filtered count matrix of > 100,000 single cells (4877 to 11532 cells per sample). Results: Using euclidean distance metrics and t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization, we explored the phenotypic overlap between samples and showed that leukemia cells from different patients were mostly dissimilar, suggesting inter-sample heterogeneity. However, samples with similar morphology and similar NPM1 mutational status were phenotypically closer (Fig A), as anticipated from bulk RNA-sequencing data (TCGA, NEJM 2013). We partitioned cells into distinct clusters using Phenograph (Levine J et al, Cell 2015) (Fig B) and measured the diversity of samples per cluster using Shannon's entropy metric, revealing that mature cell types (B/plasma cells, T/NK and erythroid cells, Fig C), presumably excluded from the tumor bulk, are transcriptionally similar across samples. Most notably, the next most diverse cluster (C36), comprising 438 cells from 11/12 samples, contains cells with a HSPC-like phenotype, as suggested by i) highest correlation of the centroid of this cluster with HSC1 (lin-/CD133+/CD34dim) population from sorted bulk RNA-sequencing data (Novershtern N et al, Cell 2011), and ii) marked GSEA enrichment for stem cell signatures (top enrichment: Jaatinen_hematopoeitic_stem_cell_up, NES = 9.04, FDR q-val = 0). To study the extent to which NPM1 or other mutations drive heterogeneity in leukemia populations, we interrogated 3'-derived single-cell sequences for all recurrent mutations in AML and found that NPM1 gene has unique features (e.g. relatively high single-cell expression and 3' localization) that allow specific identification of mutations in 5 to 34% of cells per mutated sample. To control for the high frequency of false negatives caused by dropouts in scRNA-seq data, we normalized the abundance of mutated vs wild-type cells to provide an estimation of mutation frequency in different cell types (Fig D). As expected, NPM1 mutations were rare in B and T/NK lymphoid cells (also observed using RT-qPCR in sorted populations by Dvorakova D et al, Leuk Lymphoma 2013) and were found in the majority of leukemia and myeloid cells. Interestingly, these mutations were detected at various frequencies in erythroid cells, suggesting that NPM1 mutations are acquired in cells with different lineage commitment in different patients. Most notably, the HSPC-like cluster C36 also contained a subpopulation of cells that have acquired NPM1 mutations and are transcriptionally different from wild-type cells. Conclusion: This study presents a first comprehensive single-cell map of primary AML, and the first 3'-based interrogation of mutations in single cells. It led to the identification phenotypically distinct cells presenting a HSPC-like expression profile which were sub-clonally harboring NPM1 mutations, providing the means to identify deregulated genes in these important leukemia subpopulations. Figure Figure. Disclosures Levine: Epizyme: Patents & Royalties; Celgene: Consultancy, Research Funding; Janssen: Consultancy, Honoraria; Isoplexis: Equity Ownership; C4 Therapeutics: Equity Ownership; Prelude: Research Funding; Gilead: Honoraria; Imago: Equity Ownership; Novartis: Consultancy; Roche: Consultancy, Research Funding; Loxo: Consultancy, Equity Ownership; Qiagen: Equity Ownership, Membership on an entity's Board of Directors or advisory committees.


2020 ◽  
Author(s):  
Jared Brown ◽  
Zijian Ni ◽  
Chitrasen Mohanty ◽  
Rhonda Bacher ◽  
Christina Kendziorski

AbstractMotivationNormalization to remove technical or experimental artifacts is critical in the analysis of single-cell RNA-sequencing experiments, even those for which unique molecular identifiers (UMIs) are available. The majority of methods for normalizing single-cell RNA-sequencing data adjust average expression in sequencing depth, but allow the variance and other properties of the gene-specific expression distribution to be non-constant in depth, which often results in reduced power and increased false discoveries in downstream analyses. This problem is exacerbated by the high proportion of zeros present in most datasets.ResultsTo address this, we present Dino, a normalization method based on a flexible negative-binomial mixture model of gene expression. As demonstrated in both simulated and case study datasets, by normalizing the entire gene expression distribution, Dino is robust to shallow sequencing depth, sample heterogeneity, and varying zero proportions, leading to improved performance in downstream analyses in a number of settings.Availability and implementationThe R package, Dino, is available on GitHub at https://github.com/JBrownBiostat/[email protected], [email protected]


Sign in / Sign up

Export Citation Format

Share Document