RobustClone: A robust PCA method of tumor clone and evolution inference from single-cell sequencing data

AbstractSingle-cell sequencing (SCS) data provide unprecedented insights into intratumoral heterogeneity. With SCS, we can better characterize clonal genotypes and build phylogenetic relationships of tumor cells/clones. However, high technical errors bring much noise into the genetic data, thus limiting the application of evolutionary tools in the large reservoir. To recover the low-dimensional subspace of tumor subpopulations from error-prone SCS data in the presence of corrupted and/or missing elements, we developed an efficient computational framework, termed RobustClone, to recover the true genotypes of subclones based on the low-rank matrix factorization method of extended robust principal component analysis (RPCA) and reconstruct the subclonal evolutionary tree. RobustClone is a model-free method, fast and scalable to large-scale datasets. We conducted a set of systematic evaluations on simulated datasets and demonstrated that RobustClone outperforms state-of-the-art methods, both in accuracy and efficiency. We further validated RobustClone on 2 single-cell SNV and 2 single-cell CNV datasets and demonstrated that RobustClone could recover genotype matrix and infer the subclonal evolution tree accurately under various scenarios. In particular, RobustClone revealed the spatial progression patterns of subclonal evolution on the large-scale 10X Genomics scCNV breast cancer dataset. RobustClone software is available at https://github.com/ucasdp/RobustClone.

Download Full-text

RobustClone: a robust PCA method for tumor clone and evolution inference from single-cell sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa172 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3299-3306

Author(s):

Ziwei Chen ◽

Fuzhou Gong ◽

Lin Wan ◽

Liang Ma

Keyword(s):

Single Cell ◽

Large Scale ◽

Clonal Evolution ◽

Low Rank ◽

Supplementary Information ◽

Breast Cancer Dataset ◽

Sequencing Data ◽

Cancer Dataset ◽

Single Cell Sequencing ◽

Model Free

Abstract Motivation Single-cell sequencing (SCS) data provide unprecedented insights into intratumoral heterogeneity. With SCS, we can better characterize clonal genotypes and reconstruct phylogenetic relationships of tumor cells/clones. However, SCS data are often error-prone, making their computational analysis challenging. Results To infer the clonal evolution in tumor from the error-prone SCS data, we developed an efficient computational framework, termed RobustClone. It recovers the true genotypes of subclones based on the extended robust principal component analysis, a low-rank matrix decomposition method, and reconstructs the subclonal evolutionary tree. RobustClone is a model-free method, which can be applied to both single-cell single nucleotide variation (scSNV) and single-cell copy-number variation (scCNV) data. It is efficient and scalable to large-scale datasets. We conducted a set of systematic evaluations on simulated datasets and demonstrated that RobustClone outperforms state-of-the-art methods in large-scale data both in accuracy and efficiency. We further validated RobustClone on two scSNV and two scCNV datasets and demonstrated that RobustClone could recover genotype matrix and infer the subclonal evolution tree accurately under various scenarios. In particular, RobustClone revealed the spatial progression patterns of subclonal evolution on the large-scale 10X Genomics scCNV breast cancer dataset. Availability and implementation RobustClone software is available at https://github.com/ucasdp/RobustClone. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A statistical test on single-cell data reveals widespread recurrent mutations in tumor evolution

10.1101/094722 ◽

2016 ◽

Cited By ~ 3

Author(s):

Jack Kuipers ◽

Katharina Jahn ◽

Benjamin J. Raphael ◽

Niko Beerenwinkel

Keyword(s):

Single Cell ◽

Large Scale ◽

Tumor Evolution ◽

Sequencing Data ◽

General Validity ◽

Genomic Deletions ◽

Single Cell Sequencing ◽

Statistical Framework ◽

Recurrent Mutations ◽

Complex Models

The infinite sites assumption, which states that every genomic position mutates at most once over the lifetime of a tumor, is central to current approaches for reconstructing mutation histories of tumors, but has never been tested explicitly. We developed a rigorous statistical framework to test the assumption with single-cell sequencing data. The framework accounts for the high noise and contamination present in such data. We found strong evidence for recurrent mutations at the same site in 8 out of 9 single-cell sequencing datasets from human tumors. Six cases involved the loss of earlier mutations, five of which occurred at sites unaffected by large scale genomic deletions. Two cases exhibited parallel mutation, including the dataset with the strongest evidence of recurrence. Our results refute the general validity of the infinite sites assumption and indicate that more complex models are needed to adequately quantify intra-tumor heterogeneity.

Download Full-text

Single-cell data clustering based on sparse optimization and low-rank matrix factorization

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab098 ◽

2021 ◽

Author(s):

Yinlei Hu ◽

Bin Li ◽

Falai Chen ◽

Kun Qu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Matrix Factorization ◽

Data Clustering ◽

Cell Types ◽

Low Rank ◽

Sequencing Data ◽

Rank Matrix ◽

Single Cell Rna Sequencing ◽

Low Rank Matrix

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.

Download Full-text

484 Bioturing browser: interactively explore public single cell sequencing data

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0484 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A520-A520

Author(s):

Son Pham ◽

Tri Le ◽

Tan Phan ◽

Minh Pham ◽

Huy Nguyen ◽

...

Keyword(s):

Single Cell ◽

Immune Cell ◽

Expression Profiles ◽

Meta Analysis ◽

Cell Types ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Data Formats ◽

Cancer Types ◽

Cell Data

BackgroundSingle-cell sequencing technology has opened an unprecedented ability to interrogate cancer. It reveals significant insights into the intratumoral heterogeneity, metastasis, therapeutic resistance, which facilitates target discovery and validation in cancer treatment. With rapid advancements in throughput and strategies, a particular immuno-oncology study can produce multi-omics profiles for several thousands of individual cells. This overflow of single-cell data poses formidable challenges, including standardizing data formats across studies, performing reanalysis for individual datasets and meta-analysis.MethodsN/AResultsWe present BioTuring Browser, an interactive platform for accessing and reanalyzing published single-cell omics data. The platform is currently hosting a curated database of more than 10 million cells from 247 projects, covering more than 120 immune cell types and subtypes, and 15 different cancer types. All data are processed and annotated with standardized labels of cell types, diseases, therapeutic responses, etc. to be instantly accessed and explored in a uniform visualization and analytics interface. Based on this massive curated database, BioTuring Browser supports searching similar expression profiles, querying a target across datasets and automatic cell type annotation. The platform supports single-cell RNA-seq, CITE-seq and TCR-seq data. BioTuring Browser is now available for download at www.bioturing.com.ConclusionsN/A

Download Full-text

Reference-free inference of tumor phylogenies from single-cell sequencing data

2014 IEEE 4th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) ◽

10.1109/iccabs.2014.6863944 ◽

2014 ◽

Author(s):

Ayshwarya Subramanian ◽

Russell Schwartz

Keyword(s):

Single Cell ◽

Sequencing Data ◽

Single Cell Sequencing

Download Full-text

M6a Regulator-Associated Methylation Modification Patterns Shape Tumor Microenvironment Characteristics in Hepatocellular Carcinoma

10.21203/rs.3.rs-778749/v1 ◽

2021 ◽

Author(s):

Bobin Ning ◽

Yonggan Xue ◽

Hongyi Liu ◽

Hongyu Sun ◽

Baoqing Jia

Keyword(s):

Hepatocellular Carcinoma ◽

Tumor Microenvironment ◽

Single Cell ◽

Signaling Pathways ◽

Principal Component ◽

Unsupervised Clustering ◽

Survival Prediction ◽

Multiple Perspectives ◽

Single Cell Sequencing ◽

Survival Expectations

Abstract Although substantial achievements in the tumor microenvironment (TME) of hepatocellular carcinoma (HCC) have led to fundamental improvements both in the basic research and clinical management, the potential mechanisms and regulatory relationships between m6A regulators and the TME are still unknown. We first conducted unsupervised clustering on the samples according to the core m6A expression, and then compared the signaling pathways, differential genes (DEGs), and TME between the m6A phenotypes, and re-validated the relationship between m6A regulators and TME by single cell sequencing. Then, the geneCluster was obtained by another unsupervised clustering of the DEGs, and the clinical as well as TME traits were evaluated among the geneClusters. Finally, the m6A scores of individual patients were calculated by principal component analysis (PCA) to verify the correlation from multiple perspectives, including survivals, clinical characters, mutations, TME, immunotherapy, and chemotherapy. Through a comprehensive analysis of 729 samples, we classified HCC patients into three m6A clusters and three geneClusters. Each group exhibited remarkable variations in terms of signaling pathways, clinical traits, and survival expectations. Notably, the m6A phenotypes corresponded to three different types of TME, namely immune-inflamed, immune-excluded, and immune-desert, respectively. In addition, the m6A regulator can accurately reflect the individualized microenvironment in HCC, and present supreme expression levels in the stromal microenvironment. However, the m6A score system is able to make accurate predictions not only in terms of clinical traits, survival prediction, and TME mentioned above, but also in the sensitivity of HCC patients to immunotherapy and chemotherapy. This study revealed the uniqueness and pluripotency of m6A regulators in the TME of HCC by combining single-cell sequencing and bulk sequencing. The quantified m6A modification indices were able to accurately predict patient survival expectations, clinical traits, TME, and sensitivity to immunotherapy and chemotherapy.

Download Full-text

Effective clustering for single cell sequencing cancer data

10.1101/586545 ◽

2019 ◽

Cited By ~ 2

Author(s):

Simone Ciccolella ◽

Murray Patterson ◽

Paola Bonizzoni ◽

Gianluca Della Vedova

Keyword(s):

Single Cell ◽

Categorical Data ◽

Euclidean Distance ◽

Missing Values ◽

False Negative ◽

Ground Truth ◽

Sequencing Data ◽

Large Space ◽

Cancer Data ◽

Single Cell Sequencing

AbstractBackgroundSingle cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods.One possible solution is to reduce the size of an SCS instance — usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells — and infer the tree from this reduced-size instance. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing means. However, since the values in these matrices are of a categorical nature (having the three categories: present, absent and missing), we explore techniques for clustering categorical data — commonly used in data mining and machine learning — to SCS data, with this goal in mind.ResultsIn this work, we present a new clustering procedure aimed at clustering categorical vector, or matrix data — here representing SCS instances, called celluloid. We demonstrate that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method.Finally, we demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice.AvailabilityOur approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license.

Download Full-text

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

Download Full-text