scholarly journals Normalizing single-cell RNA sequencing data with internal spike-in-like genes

2020 ◽  
Author(s):  
Li Lin ◽  
Minfang Song ◽  
Yong Jiang ◽  
Xiaojing Zhao ◽  
Haopeng Wang ◽  
...  

ABSTRACTNormalization with respect to sequencing depth is a crucial step in single-cell RNA sequencing preprocessing. Most methods normalize data using the whole transcriptome based on the assumption that the majority of transcriptome remains constant and are unable to detect drastic changes of the transcriptome. Here, we develop an algorithm based on a small fraction of constantly expressed genes as internal spike-ins to normalize single cell RNA sequencing data. We demonstrate that the transcriptome of single cells may undergo drastic changes in several case study datasets and accounting for such heterogeneity by ISnorm improves the performance of downstream analyzes.

2020 ◽  
Vol 2 (3) ◽  
Author(s):  
Li Lin ◽  
Minfang Song ◽  
Yong Jiang ◽  
Xiaojing Zhao ◽  
Haopeng Wang ◽  
...  

Abstract Normalization with respect to sequencing depth is a crucial step in single-cell RNA sequencing preprocessing. Most methods normalize data using the whole transcriptome based on the assumption that the majority of transcriptome remains constant and are unable to detect drastic changes of the transcriptome. Here, we develop an algorithm based on a small fraction of constantly expressed genes as internal spike-ins to normalize single-cell RNA sequencing data. We demonstrate that the transcriptome of single cells may undergo drastic changes in several case study datasets and accounting for such heterogeneity by ISnorm (Internal Spike-in-like-genes normalization) improves the performance of downstream analyses.


2019 ◽  
Author(s):  
Imad Abugessaisa ◽  
Shuhei Noguchi ◽  
Melissa Cardon ◽  
Akira Hasegawa ◽  
Kazuhide Watanabe ◽  
...  

AbstractAnalysis and interpretation of single-cell RNA-sequencing (scRNA-seq) experiments are compromised by the presence of poor quality cells. For meaningful analyses, such poor quality cells should be excluded to avoid biases and large variation. However, no clear guidelines exist. We introduce SkewC, a novel quality-assessment method to identify poor quality single-cells in scRNA-seq experiments. The method is based on the assessment of gene coverage for each single cell and its skewness as a quality measure. To validate the method, we investigated the impact of poor quality cells on downstream analyses and compared biological differences between typical and poor quality cells. Moreover, we measured the ratio of intergenic expression, suggesting genomic contamination, and foreign organism contamination of single-cell samples. SkewC is tested in 37,993 single-cells generated by 15 scRNA-seq protocols. We envision SkewC as an indispensable QC method to be incorporated into scRNA-seq experiment to preclude the possibility of scRNA-seq data misinterpretation.


Genes ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 240 ◽  
Author(s):  
Prashant N. M. ◽  
Hongyu Liu ◽  
Pavlos Bousounis ◽  
Liam Spurr ◽  
Nawaf Alomran ◽  
...  

With the recent advances in single-cell RNA-sequencing (scRNA-seq) technologies, the estimation of allele expression from single cells is becoming increasingly reliable. Allele expression is both quantitative and dynamic and is an essential component of the genomic interactome. Here, we systematically estimate the allele expression from heterozygous single nucleotide variant (SNV) loci using scRNA-seq data generated on the 10×Genomics Chromium platform. We analyzed 26,640 human adipose-derived mesenchymal stem cells (from three healthy donors), sequenced to an average of 150K sequencing reads per cell (more than 4 billion scRNA-seq reads in total). High-quality SNV calls assessed in our study contained approximately 15% exonic and >50% intronic loci. To analyze the allele expression, we estimated the expressed variant allele fraction (VAFRNA) from SNV-aware alignments and analyzed its variance and distribution (mono- and bi-allelic) at different minimum sequencing read thresholds. Our analysis shows that when assessing positions covered by a minimum of three unique sequencing reads, over 50% of the heterozygous SNVs show bi-allelic expression, while at a threshold of 10 reads, nearly 90% of the SNVs are bi-allelic. In addition, our analysis demonstrates the feasibility of scVAFRNA estimation from current scRNA-seq datasets and shows that the 3′-based library generation protocol of 10×Genomics scRNA-seq data can be informative in SNV-based studies, including analyses of transcriptional kinetics.


2019 ◽  
Vol 28 (21) ◽  
pp. 3569-3583 ◽  
Author(s):  
Patricia M Schnepp ◽  
Mengjie Chen ◽  
Evan T Keller ◽  
Xiang Zhou

Abstract Integrating single-cell RNA sequencing (scRNA-seq) data with genotypes obtained from DNA sequencing studies facilitates the detection of functional genetic variants underlying cell type-specific gene expression variation. Unfortunately, most existing scRNA-seq studies do not come with DNA sequencing data; thus, being able to call single nucleotide variants (SNVs) from scRNA-seq data alone can provide crucial and complementary information, detection of functional SNVs, maximizing the potential of existing scRNA-seq studies. Here, we perform extensive analyses to evaluate the utility of two SNV calling pipelines (GATK and Monovar), originally designed for SNV calling in either bulk or single-cell DNA sequencing data. In both pipelines, we examined various parameter settings to determine the accuracy of the final SNV call set and provide practical recommendations for applied analysts. We found that combining all reads from the single cells and following GATK Best Practices resulted in the highest number of SNVs identified with a high concordance. In individual single cells, Monovar resulted in better quality SNVs even though none of the pipelines analyzed is capable of calling a reasonable number of SNVs with high accuracy. In addition, we found that SNV calling quality varies across different functional genomic regions. Our results open doors for novel ways to leverage the use of scRNA-seq for the future investigation of SNV function.


Author(s):  
Abha S Bais ◽  
Dennis Kostka

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Results With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. Availability and implementation scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). Supplementary information Supplementary data are available at Bioinformatics online.


Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 800-800
Author(s):  
Jens G Lohr ◽  
Sora Kim ◽  
Joshua Gould ◽  
Birgit Knoechel ◽  
Yotam Drier ◽  
...  

Abstract Continuous genomic evolution has been a major limitation to curative treatment of multiple myeloma (MM). Frequent monitoring of the genetic heterogeneity in MM from blood, rather than serial bone marrow (BM) biopsies, would therefore be desirable. We hypothesized that genomic characterization of circulating MM cells (CMMCs) recapitulates the genetics of MM in BM biopsies, enables MM classification, and is feasible in the majority of MM patients with active disease. Methods: To test these hypotheses, we developed a method to enrich, purify and isolate single CMMCs with a sensitivity of at least 1:10(5). We then performed DNA- and RNA-sequencing of single CMMCs and compared them to single BM-derived MM cells. We determined CMMC numbers in 24 randomly selected MM patient samples and compared them to numbers of circulating MM cells obtained by flow cytometry. We performed single-cell whole genome amplification of single cells from 10 MM patients, and targeted sequencing of the 35 most recurrently mutated loci in MM. A total of 568 single primary cells representing CMMCs, BM MM cells, CD19+ B lymphocytes, CD45+CD138- WBC from these patients were subjected to DNA-sequencing. By processing 80 single cells from four MM cell lines with known mutations we determined the mean sensitivity of mutation detection in single cells to be 93 ± 9%. In addition to DNA-sequencing we also isolated 57 single MM cells from the BM and peripheral blood of two MM patients and performed whole transcriptome single cell RNA-sequencing. Results: In 24 randomly selected MM patient samples we detected >12 CMMCs per 1ml of blood in all 24 patients. In comparison, by flow cytometry, we detected ≥10 CMMCs per 10(5) white blood cells in 10/24 cases (42%), ≥1 CMMC but ≤ 10 CMMCs in 13/24 cases (54%), and < 1 CTCs in 1/24 patients (4%). Mutational analysis of 35 recurrently mutated loci in 335 high quality single MM cells from the blood and BM of 10 patients, including one MGUS patient, revealed the presence of a total of 12 mutations (in KRAS, NRAS, BRAF, IRF4 and TP53). All targeted mutations that were detected by clinical-grade genotyping of bulk BM were also detected in single cell analysis of CMMCs. While in most patients, the fraction of mutated single cells was similar between blood and BM, in three patients, the proportion of MM cells harboring TP53 R273C, BRAF G469A and NRAS G13D mutations was significantly higher in the blood than in the BM, suggesting a different clonal composition. We developed an analytical model to predict whether a genetic locus underwent loss of heterozygosity, using the distribution of known allelic fractions of previously described mutations in MM cell lines as a benchmark. In two patients who simultaneously harbored two mutations, we predicted a BRAF G469E and a KRAS G12C mutation to be heterozygous, whereas the loci harboring a TP53 R273C and a TP53 R280T mutation were predicted to be associated with LOH with high statistical confidence. Whole transcriptome single cell RNA-sequencing of 57 MM cells from the BM and peripheral blood of two patients showed >3,700 transcripts per cell. Single-cell RNA-sequencing allowed for a clear distinction between normal plasma cells and MM cells, either based on analysis of CD45, CD27, and CD56 alone, or by unsupervised hierarchical clustering of detected transcripts in single cells. In addition, single cell CMMC expression analysis could be used to infer the existence of key MM chromosomal translocations. For example, CCND1 and CCND3 were highly upregulated in single MM cells from the blood and BM of two patients, whose MM was found by FISH analysis to harbor a t(11;14) and a t(6;14) translocation, respectively. Conclusion: We demonstrate that extensive genomic characterization of MM is feasible from very small numbers of CMMCs with single cell resolution. Interrogation of single CMMCs faithfully reproduces the pattern of somatic mutations present in MM in the BM, identifies actionable oncogenes, and reveals if somatic mutated loci underwent loss of heterozygosity. Single CMMCs also reveal mutations that are not detectable in the BM either by single cell sequencing or clinical grade bulk sequencing. Single cell RNA-sequencing of CMMCs provides robust transcriptomic profiling, allowing for class-differentiation and inference of translocations in MM patients. Disclosures Raje: Amgen: Consultancy, Membership on an entity's Board of Directors or advisory committees; Celgene: Consultancy, Membership on an entity's Board of Directors or advisory committees; Takeda: Consultancy, Membership on an entity's Board of Directors or advisory committees; Merck: Membership on an entity's Board of Directors or advisory committees; Novartis: Consultancy, Membership on an entity's Board of Directors or advisory committees; Roche: Consultancy, Membership on an entity's Board of Directors or advisory committees; BMS: Consultancy, Membership on an entity's Board of Directors or advisory committees; AstraZeneca: Research Funding; Eli Lilly: Research Funding.


2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Xin Zhao ◽  
Shouguo Gao ◽  
Sachiko Kajigaya ◽  
Qingguo Liu ◽  
Zhijie Wu ◽  
...  

Abstract Objective Single cell methodology enables detection and quantification of transcriptional changes and unravelling dynamic aspects of the transcriptional heterogeneity not accessible using bulk sequencing approaches. We have applied single-cell RNA-sequencing (scRNA-seq) to fresh human bone marrow CD34+ cells and profiled 391 single hematopoietic stem/progenitor cells (HSPCs) from healthy donors to characterize lineage- and stage-specific transcription during hematopoiesis. Results Cells clustered into six distinct groups, which could be assigned to known HSPC subpopulations based on lineage specific genes. Reconstruction of differentiation trajectories in single cells revealed four committed lineages derived from HSCs, as well as dynamic expression changes underlying cell fate during early erythroid-megakaryocytic, lymphoid, and granulocyte-monocyte differentiation. A similar non-hierarchical pattern of hematopoiesis could be derived from analysis of published single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), consistent with a sequential relationship between chromatin dynamics and regulation of gene expression during lineage commitment (first, altered chromatin conformation, then mRNA transcription). Computationally, we have reconstructed molecular trajectories connecting HSCs directly to four hematopoietic lineages. Integration of long noncoding RNA (lncRNA) expression from the same cells demonstrated mRNA transcriptome, lncRNA, and the epigenome were highly homologous in their pattern of gene activation and suppression during hematopoietic cell differentiation.


2017 ◽  
Vol 7 (1) ◽  
Author(s):  
Simone Rizzetto ◽  
Auda A. Eltahla ◽  
Peijie Lin ◽  
Rowena Bull ◽  
Andrew R. Lloyd ◽  
...  

Author(s):  
Cornelia Fuetterer ◽  
Thomas Augustin ◽  
Christiane Fuchs

AbstractThe analysis of single-cell RNA sequencing data is of great importance in health research. It challenges data scientists, but has enormous potential in the context of personalized medicine. The clustering of single cells aims to detect different subgroups of cell populations within a patient in a data-driven manner. Some comparison studies denote single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483–486, 2017), as the best method for classifying single-cell RNA sequencing data. SC3 includes Laplacian eigenmaps and a principal component analysis (PCA). Our proposal of unsupervised adapted single-cell consensus clustering (adaSC3) suggests to replace the linear PCA by diffusion maps, a non-linear method that takes the transition of single cells into account. We investigate the performance of adaSC3 in terms of accuracy on the data sets of the original source of SC3 as well as in a simulation study. A comparison of adaSC3 with SC3 as well as with related algorithms based on further alternative dimension reduction techniques shows a quite convincing behavior of adaSC3.


Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 995-995
Author(s):  
Vincent-Philippe Lavallee ◽  
Elham Azizi ◽  
Vaidotas Kiseliovas ◽  
Ignas Masilionis ◽  
Linas Mazutis ◽  
...  

Abstract Introduction: Acute myeloid leukemia (AML) evolution is a multistep process in which cells evolve from hematopoietic stem and progenitor cells (HSPCs) that acquire genetic anomalies, such as chromosomal rearrangements and mutations, which define distinct subgroups. Mutations in Nucleophosmin 1 (NPM1), which occur in ~30% patients, are the most frequent subgroup-defining mutations in AML and appear to be a late driver event in this disease. Bulk RNA-sequencing studies have identified differentially expressed genes between AML subgroups, but they are uninformative of the composition of cell types populating each sample. Large scale Single-cell RNA sequencing (scRNA-seq) technologies now enable a detailed characterization of intra tumoral heterogeneity, and could help to better understand the stepwise evolution from normal to malignant cells. Methods: Twelve primary human AML specimens from MSKCC and Quebec Leukemia Cell Bank, including 8 with NPM1 mutations, were included in this cohort. Cells were subjected to scRNA-seq using 10X Genomics Chromium Single Cell 3' protocols and libraries were sequenced on Illumina HiSeq or NovaSeq platforms. FASTQ files were processed using SEQC pipeline (Azizi E et al, Cell 2018), resulting in a carefully filtered count matrix of > 100,000 single cells (4877 to 11532 cells per sample). Results: Using euclidean distance metrics and t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization, we explored the phenotypic overlap between samples and showed that leukemia cells from different patients were mostly dissimilar, suggesting inter-sample heterogeneity. However, samples with similar morphology and similar NPM1 mutational status were phenotypically closer (Fig A), as anticipated from bulk RNA-sequencing data (TCGA, NEJM 2013). We partitioned cells into distinct clusters using Phenograph (Levine J et al, Cell 2015) (Fig B) and measured the diversity of samples per cluster using Shannon's entropy metric, revealing that mature cell types (B/plasma cells, T/NK and erythroid cells, Fig C), presumably excluded from the tumor bulk, are transcriptionally similar across samples. Most notably, the next most diverse cluster (C36), comprising 438 cells from 11/12 samples, contains cells with a HSPC-like phenotype, as suggested by i) highest correlation of the centroid of this cluster with HSC1 (lin-/CD133+/CD34dim) population from sorted bulk RNA-sequencing data (Novershtern N et al, Cell 2011), and ii) marked GSEA enrichment for stem cell signatures (top enrichment: Jaatinen_hematopoeitic_stem_cell_up, NES = 9.04, FDR q-val = 0). To study the extent to which NPM1 or other mutations drive heterogeneity in leukemia populations, we interrogated 3'-derived single-cell sequences for all recurrent mutations in AML and found that NPM1 gene has unique features (e.g. relatively high single-cell expression and 3' localization) that allow specific identification of mutations in 5 to 34% of cells per mutated sample. To control for the high frequency of false negatives caused by dropouts in scRNA-seq data, we normalized the abundance of mutated vs wild-type cells to provide an estimation of mutation frequency in different cell types (Fig D). As expected, NPM1 mutations were rare in B and T/NK lymphoid cells (also observed using RT-qPCR in sorted populations by Dvorakova D et al, Leuk Lymphoma 2013) and were found in the majority of leukemia and myeloid cells. Interestingly, these mutations were detected at various frequencies in erythroid cells, suggesting that NPM1 mutations are acquired in cells with different lineage commitment in different patients. Most notably, the HSPC-like cluster C36 also contained a subpopulation of cells that have acquired NPM1 mutations and are transcriptionally different from wild-type cells. Conclusion: This study presents a first comprehensive single-cell map of primary AML, and the first 3'-based interrogation of mutations in single cells. It led to the identification phenotypically distinct cells presenting a HSPC-like expression profile which were sub-clonally harboring NPM1 mutations, providing the means to identify deregulated genes in these important leukemia subpopulations. Figure Figure. Disclosures Levine: Epizyme: Patents & Royalties; Celgene: Consultancy, Research Funding; Janssen: Consultancy, Honoraria; Isoplexis: Equity Ownership; C4 Therapeutics: Equity Ownership; Prelude: Research Funding; Gilead: Honoraria; Imago: Equity Ownership; Novartis: Consultancy; Roche: Consultancy, Research Funding; Loxo: Consultancy, Equity Ownership; Qiagen: Equity Ownership, Membership on an entity's Board of Directors or advisory committees.


Sign in / Sign up

Export Citation Format

Share Document