Quasi-universality in single-cell sequencing data

Mapping Intimacies ◽

10.1101/426239 ◽

2018 ◽

Cited By ~ 2

Author(s):

Luis Aparicio ◽

Mykola Bordyuh ◽

Andrew J. Blumberg ◽

Raul Rabadan

Keyword(s):

Single Cell ◽

Matrix Theory ◽

Biological Information ◽

Sequencing Data ◽

Data Set ◽

Single Cell Sequencing ◽

Marked Cell ◽

Eigenvector Localization ◽

Cell Data ◽

Epigenetic Processes

ABSTRACTThe development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.

484 Bioturing browser: interactively explore public single cell sequencing data

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0484 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A520-A520

Author(s):

Son Pham ◽

Tri Le ◽

Tan Phan ◽

Minh Pham ◽

Huy Nguyen ◽

...

Keyword(s):

Single Cell ◽

Immune Cell ◽

Expression Profiles ◽

Meta Analysis ◽

Cell Types ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Data Formats ◽

Cancer Types ◽

Cell Data

BackgroundSingle-cell sequencing technology has opened an unprecedented ability to interrogate cancer. It reveals significant insights into the intratumoral heterogeneity, metastasis, therapeutic resistance, which facilitates target discovery and validation in cancer treatment. With rapid advancements in throughput and strategies, a particular immuno-oncology study can produce multi-omics profiles for several thousands of individual cells. This overflow of single-cell data poses formidable challenges, including standardizing data formats across studies, performing reanalysis for individual datasets and meta-analysis.MethodsN/AResultsWe present BioTuring Browser, an interactive platform for accessing and reanalyzing published single-cell omics data. The platform is currently hosting a curated database of more than 10 million cells from 247 projects, covering more than 120 immune cell types and subtypes, and 15 different cancer types. All data are processed and annotated with standardized labels of cell types, diseases, therapeutic responses, etc. to be instantly accessed and explored in a uniform visualization and analytics interface. Based on this massive curated database, BioTuring Browser supports searching similar expression profiles, querying a target across datasets and automatic cell type annotation. The platform supports single-cell RNA-seq, CITE-seq and TCR-seq data. BioTuring Browser is now available for download at www.bioturing.com.ConclusionsN/A

Cellsnp-lite: an efficient tool for genotyping single cells

10.1101/2020.12.31.424913 ◽

2021 ◽

Author(s):

Xianjie Huang ◽

Yuanhua Huang

Keyword(s):

Single Cell ◽

Single Cells ◽

Basic Research ◽

Substantial Improvement ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Memory Efficiency ◽

Computational Speed ◽

Cell Data

AbstractSummarySingle-cell sequencing is an increasingly used technology and has promising applications in basic research and clinical translations. However, genotyping methods developed for bulk sequencing data have not been well adapted for single-cell data, in terms of both computational parallelization and simplified user interface. Here we introduce a software, cellsnp-lite, implemented in C/C++ and based on well supported package htslib, for genotyping in single-cell sequencing data for both droplet and well based platforms. On various experimental data sets, it shows substantial improvement in computational speed and memory efficiency with retaining highly concordant results compared to existing methods. Cellsnp-lite therefore lightens the genetic analysis for increasingly large single-cell data.AvailabilityThe source code is freely available at https://github.com/single-cell-genetics/[email protected]

SCeQTL: an R package for identifying eQTL from single-cell parallel sequencing data

10.1101/499863 ◽

2018 ◽

Cited By ~ 3

Author(s):

Yue Hu ◽

Xuegong Zhang

Keyword(s):

Gene Expression ◽

Single Cell ◽

Negative Binomial ◽

Negative Binomial Regression ◽

R Package ◽

Eqtl Analysis ◽

Sequencing Data ◽

Parallel Sequencing ◽

Single Cell Sequencing ◽

Cell Data

With the development of single-cell sequencing technologies, parallel sequencing the transcriptome and genome is becoming available and will bring us the opportunity to uncover association between genotype and phenotype at single-cell level. Due to the special characteristics of single-cell sequencing data, new method is needed to identify eQTL from single-cell data. We developed an R package SCeQTL that uses zero-inflated negative binomial regression to do eQTL analysis on single-cell data. It can distinguish two type of gene-expression differences among different genotype groups. It can also be used for finding gene expression variations associated with other grouping factors like cell lineages or cell types.

Phenotype-guided subpopulation identification from single-cell sequencing data

10.1101/2020.06.05.137240 ◽

2020 ◽

Author(s):

Duanchen Sun ◽

Xiangnan Guan ◽

Amy E. Moran ◽

David Z. Qian ◽

Pepper Schedin ◽

...

Keyword(s):

Lung Cancer ◽

Single Cell ◽

Clinical Information ◽

Single Step ◽

Cell Subpopulation ◽

Clustering Methods ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Cell Subpopulations ◽

Cell Data

AbstractSingle-cell sequencing yields novel discoveries by distinguishing cell types, states and lineages within the context of heterogeneous tissues. However, interpreting complex single-cell data from highly heterogeneous cell populations remains challenging. Currently, most existing single-cell data analyses focus on cell type clusters defined by unsupervised clustering methods, which cannot directly link cell clusters with specific biological and clinical phenotypes. Here we present Scissor, a novel approach that utilizes disease phenotypes to identify cell subpopulations from single-cell data that most highly correlate with a given phenotype. This “phenotype-to-cell within a single step” strategy enables the utilization of a large amount of clinical information that has been collected for bulk assays to identify the most highly phenotype-associated cell subpopulations. When applied to a lung cancer single-cell RNA-seq (scRNA-seq) dataset, Scissor identified a subset of cells exhibiting high hypoxia activities, which predicted worse survival outcomes in lung cancer patients. Furthermore, in a melanoma scRNA-seq dataset, Scissor discerned a T cell subpopulation with low PDCD1/CTLA4 and high TCF7 expressions, which is associated with a favorable immunotherapy response. Thus, Scissor provides a novel framework to identify the biologically and clinically relevant cell subpopulations from single-cell assays by leveraging the wealth of phenotypes and bulk-omics datasets.

Measuring the Information Obtained from a Single-Cell Sequencing Experiment

10.1101/2020.10.01.322255 ◽

2020 ◽

Author(s):

Michael J. Casey ◽

Rubén J. Sánchez-García ◽

Ben D. MacArthur

Keyword(s):

Single Cell ◽

Expression Patterns ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Sequencing Data ◽

Data Set ◽

Single Cell Sequencing ◽

Amount Of Information ◽

Formal Framework

ABSTRACTSingle-cell sequencing (sc-Seq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information. Here, we introduce a formal framework for assessing the amount of information obtained from a sc-Seq experiment, which can be used throughout the sc-Seq analysis pipeline, including for quality control, feature selection and cluster evaluation. We illustrate this framework with some simple examples, including using it to quantify the amount of information in a single-cell sequencing data set that is explained by a proposed clustering, and thereby to determine cluster quality. Our information-theoretic framework provides a formal way to assess the quality of data obtained from sc-Seq experiments and the effectiveness of analyses performed, with wide implications for our understanding of variability in gene expression patterns within heterogeneous cell populations.

Conifer: clonal tree inference for tumor heterogeneity with single-cell and bulk sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04338-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Leila Baghaarabani ◽

Sama Goliaei ◽

Mohammad-Hadi Foroughmand-Araabi ◽

Seyed Peyman Shariatpanahi ◽

Bahram Goliaei

Keyword(s):

Single Cell ◽

Tumor Heterogeneity ◽

Temporal Order ◽

Variant Allele ◽

Evolutionary Relationships ◽

Sequencing Data ◽

Variant Allele Frequency ◽

Single Cell Sequencing ◽

Tree Inference ◽

Cell Data

Abstract Background Genetic heterogeneity of a cancer tumor that develops during clonal evolution is one of the reasons for cancer treatment failure, by increasing the chance of drug resistance. Clones are cell populations with different genotypes, resulting from differences in somatic mutations that occur and accumulate during cancer development. An appropriate approach for identifying clones is determining the variant allele frequency of mutations that occurred in the tumor. Although bulk sequencing data can be used to provide that information, the frequencies are not informative enough for identifying different clones with the same prevalence and their evolutionary relationships. On the other hand, single-cell sequencing data provides valuable information about branching events in the evolution of a cancerous tumor. However, the temporal order of mutations may be determined with ambiguities using only single-cell data, while variant allele frequencies from bulk sequencing data can provide beneficial information for inferring the temporal order of mutations with fewer ambiguities. Result In this study, a new method called Conifer (ClONal tree Inference For hEterogeneity of tumoR) is proposed which combines aggregated variant allele frequency from bulk sequencing data with branching event information from single-cell sequencing data to more accurately identify clones and their evolutionary relationships. It is proven that the accuracy of clone identification and clonal tree inference is increased by using Conifer compared to other existing methods on various sets of simulated data. In addition, it is discussed that the evolutionary tree provided by Conifer on real cancer data sets is highly consistent with information in both bulk and single-cell data. Conclusions In this study, we have provided an accurate and robust method to identify clones of tumor heterogeneity and their evolutionary history by combining single-cell and bulk sequencing data.

Reference-free inference of tumor phylogenies from single-cell sequencing data

2014 IEEE 4th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) ◽

10.1109/iccabs.2014.6863944 ◽

2014 ◽

Author(s):

Ayshwarya Subramanian ◽

Russell Schwartz

Keyword(s):

Single Cell ◽

Sequencing Data ◽

Single Cell Sequencing

Effective clustering for single cell sequencing cancer data

10.1101/586545 ◽

2019 ◽

Cited By ~ 2

Author(s):

Simone Ciccolella ◽

Murray Patterson ◽

Paola Bonizzoni ◽

Gianluca Della Vedova

Keyword(s):

Single Cell ◽

Categorical Data ◽

Euclidean Distance ◽

Missing Values ◽

False Negative ◽

Ground Truth ◽

Sequencing Data ◽

Large Space ◽

Cancer Data ◽

Single Cell Sequencing

AbstractBackgroundSingle cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods.One possible solution is to reduce the size of an SCS instance — usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells — and infer the tree from this reduced-size instance. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing means. However, since the values in these matrices are of a categorical nature (having the three categories: present, absent and missing), we explore techniques for clustering categorical data — commonly used in data mining and machine learning — to SCS data, with this goal in mind.ResultsIn this work, we present a new clustering procedure aimed at clustering categorical vector, or matrix data — here representing SCS instances, called celluloid. We demonstrate that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method.Finally, we demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice.AvailabilityOur approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license.

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

scTree: An R package to generate antibody-compatible classifiers from single-cell sequencing data

The Journal of Open Source Software ◽

10.21105/joss.02061 ◽

2020 ◽

Vol 5 (48) ◽

pp. 2061

Author(s):

J. Paez ◽

Michael Wendt ◽

Nadia Lanman

Keyword(s):

Single Cell ◽

R Package ◽

Sequencing Data ◽

Single Cell Sequencing