gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data

Abstract Background Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored. Results We present a new tool, , that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell () tool is open source and available at https://github.com/AlgoLab/gpps. Conclusions provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.

Download Full-text

gpps: An ILP-based approach for inferring cancer progression with mutation losses from single cell data

10.1101/365635 ◽

2018 ◽

Cited By ~ 1

Author(s):

Simone Ciccolella ◽

Mauricio Soto Gomez ◽

Murray Patterson ◽

Gianluca Della Vedova ◽

Iman Hajirasouliha ◽

...

Keyword(s):

Single Cell ◽

Open Source ◽

Computational Methods ◽

Cancer Progression ◽

Fixed Number ◽

Inference Problem ◽

Fundamental Feature ◽

Single Cell Sequencing ◽

Cell Data ◽

Tumor Phylogeny

AbstractMotivationIn recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progression where mutations are accumulated through histories. However, some recent studies leveraging Single Cell Sequencing (SCS) techniques have shown evidence of mutation losses in several tumor samples [19], making the inference problem harder.ResultsWe present a new tool, gpps, that reconstructs a tumor phylogeny from single cell data, allowing each mutation to be lost at most a fixed number of times.AvailabilityThe General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gppf.

Download Full-text

Profiles of expressed mutations in single cells reveal patterns of tumor evolution and therapeutic impact of intratumor heterogeneity

10.1101/2021.03.26.437185 ◽

2021 ◽

Author(s):

Farid Rashidi Mehrabadi ◽

Kerrie L. Marie ◽

Eva Perez-Guijarro ◽

Salem Malikic ◽

Erfan Sadeqi Azer ◽

...

Keyword(s):

Single Cell ◽

Expression Profiles ◽

Single Cells ◽

Phylogeny Reconstruction ◽

Intratumor Heterogeneity ◽

Tumor Evolution ◽

Sequencing Data ◽

Genomic Alterations ◽

Reconstruction Methods ◽

Tumor Phylogeny

Advances in single cell RNA sequencing (scRNAseq) technologies uncovered an unexpected complexity in solid tumors, underlining the relevance of intratumor heterogeneity for cancer progression and therapeutic resistance. Heterogeneity in the mutational composition of cancer cells is well captured by tumor phylogenies, which demonstrate how distinct cell populations evolve, and, e.g. develop metastatic potential or resistance to specific treatments. Unfortunately, because of their low read coverage per cell, mutation calls that can be made from scRNAseq data are very sparse and noisy. Additionally, available tumor phylogeny reconstruction methods cannot computationally handle a large number of cells and mutations present in typical scRNAseq datasets. Finally, there are no principled methods to assess distinct subclones observed in inferred tumor phylogenies and the genomic alterations that seed them. Here we present Trisicell, a computational toolkit for scalable tumor phylogeny reconstruction and evaluation from scRNAseq as well as single cell genome or exome sequencing data. Trisicell allows the identification of reliable subtrees of a tumor phylogeny, offering the ability to focus on the most important subclones and the genomic alterations that are associated with subclonal proliferation. We comprehensively assessed Trisicell on a melanoma model by comparing the phylogeny it builds using scRNAseq data, to those using matching bulk whole exome (bWES) and transcriptome (bWTS) sequencing data from clonal sublines derived from single cells. Our results demonstrate that tumor phylogenies based on mutation calls from scRNAseq data can be robustly inferred and evaluated by Trisicell. We also applied Trisicell to reconstruct and evaluate the phylogeny it builds using scRNAseq data from melanomas of the same mouse model after treatment with immune checkpoint blockade (ICB). After integratively analyzing our cell-specific mutation calls with their expression profiles, we observed that each subclone with a distinct set of novel somatic mutations is strongly associated with a distinct developmental status. Moreover, each subclone had developed a specific ICB-resistance mechanism. These results demonstrate that Trisicell can robustly utilize scRNAseq data to delineate intratumoral heterogeneity and tumor evolution.

Download Full-text

PhISCS - A Combinatorial Approach for Sub-perfect Tumor Phylogeny Reconstruction via Integrative use of Single Cell and Bulk Sequencing Data

10.1101/376996 ◽

2018 ◽

Cited By ~ 9

Author(s):

Salem Malikic ◽

Simone Ciccolella ◽

Farid Rashidi Mehrabadi ◽

Camir Ricketts ◽

Khaledur Rahman ◽

...

Keyword(s):

Single Cell ◽

Phylogeny Reconstruction ◽

Tumor Evolution ◽

Sequencing Data ◽

Perfect Phylogeny ◽

Sequence Coverage ◽

Allele Dropout ◽

Single Cell Sequencing ◽

First Time ◽

Tumor Phylogeny

AbstractRecent technological advances in single cell sequencing (SCS) provide high resolution data for studying intra-tumor heterogeneity and tumor evolution. Available computational methods for tumor phylogeny inference via SCS typically aim to identify the most likelyperfect phylogeny treesatisfyinginfinite sites assumption(ISA). However limitations of SCS technologies such as frequent allele dropout or highly variable sequence coverage, commonly result in mutational call errors and prohibit a perfect phylogeny. In addition, ISA violations are commonly observed in tumor phylogenies due to the loss of heterozygosity, deletions and convergent evolution. In order to address such limitations, we, for the first time, introduce a new combinatorial formulation that integrates single cell sequencing data with matching bulk sequencing data, with the objective of minimizing a linear combination of (i) potential false negatives (due to e.g. allele dropout or variance in sequence coverage) and (ii) potential false positives (due to e.g. read errors) among mutation calls, as well as (iii) the number of mutations that violate ISA - to define theoptimal sub-perfect phylogeny.Our formulation ensures that several lineage constraints imposed by the use of variant allele frequencies (VAFs, derived from bulk sequence data) are satisfied. We express our formulation both in the form of an integer linear program (ILP) and - for the first time in the context of tumor phylogeny reconstruction - a boolean constraint satisfaction problem (CSP) and solve them by leveraging state-of-the-art ILP/CSP solvers. The resulting method, which we name PhISCS, is the first to integrate SCS and bulk sequencing data under the finite sites model. Using several simulated and real SCS data sets, we demonstrate that PhISCS is not only more general but also more accurate than the alternative tumor phylogeny inference tools. PhISCS is very fast especially when its CSP based variant is used returns the optimal solution, except in rare instances for which it provides an optimality gap. PhISCS is available athttps://github.com/haghshenas/PhISCS.

Download Full-text

PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data

Genome Research ◽

10.1101/gr.234435.118 ◽

2019 ◽

Vol 29 (11) ◽

pp. 1860-1877 ◽

Cited By ~ 20

Author(s):

Salem Malikic ◽

Farid Rashidi Mehrabadi ◽

Simone Ciccolella ◽

Md. Khaledur Rahman ◽

Camir Ricketts ◽

...

Keyword(s):

Single Cell ◽

Phylogeny Reconstruction ◽

Combinatorial Approach ◽

Sequencing Data ◽

Tumor Phylogeny

Download Full-text

484 Bioturing browser: interactively explore public single cell sequencing data

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0484 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A520-A520

Author(s):

Son Pham ◽

Tri Le ◽

Tan Phan ◽

Minh Pham ◽

Huy Nguyen ◽

...

Keyword(s):

Single Cell ◽

Immune Cell ◽

Expression Profiles ◽

Meta Analysis ◽

Cell Types ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Data Formats ◽

Cancer Types ◽

Cell Data

BackgroundSingle-cell sequencing technology has opened an unprecedented ability to interrogate cancer. It reveals significant insights into the intratumoral heterogeneity, metastasis, therapeutic resistance, which facilitates target discovery and validation in cancer treatment. With rapid advancements in throughput and strategies, a particular immuno-oncology study can produce multi-omics profiles for several thousands of individual cells. This overflow of single-cell data poses formidable challenges, including standardizing data formats across studies, performing reanalysis for individual datasets and meta-analysis.MethodsN/AResultsWe present BioTuring Browser, an interactive platform for accessing and reanalyzing published single-cell omics data. The platform is currently hosting a curated database of more than 10 million cells from 247 projects, covering more than 120 immune cell types and subtypes, and 15 different cancer types. All data are processed and annotated with standardized labels of cell types, diseases, therapeutic responses, etc. to be instantly accessed and explored in a uniform visualization and analytics interface. Based on this massive curated database, BioTuring Browser supports searching similar expression profiles, querying a target across datasets and automatic cell type annotation. The platform supports single-cell RNA-seq, CITE-seq and TCR-seq data. BioTuring Browser is now available for download at www.bioturing.com.ConclusionsN/A

Download Full-text

Effective clustering for single cell sequencing cancer data

10.1101/586545 ◽

2019 ◽

Cited By ~ 2

Author(s):

Simone Ciccolella ◽

Murray Patterson ◽

Paola Bonizzoni ◽

Gianluca Della Vedova

Keyword(s):

Single Cell ◽

Categorical Data ◽

Euclidean Distance ◽

Missing Values ◽

False Negative ◽

Ground Truth ◽

Sequencing Data ◽

Large Space ◽

Cancer Data ◽

Single Cell Sequencing

AbstractBackgroundSingle cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods.One possible solution is to reduce the size of an SCS instance — usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells — and infer the tree from this reduced-size instance. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing means. However, since the values in these matrices are of a categorical nature (having the three categories: present, absent and missing), we explore techniques for clustering categorical data — commonly used in data mining and machine learning — to SCS data, with this goal in mind.ResultsIn this work, we present a new clustering procedure aimed at clustering categorical vector, or matrix data — here representing SCS instances, called celluloid. We demonstrate that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method.Finally, we demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice.AvailabilityOur approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license.

Download Full-text

Algorithmic methods to infer the evolutionary trajectories in cancer progression

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1520213113 ◽

2016 ◽

Vol 113 (28) ◽

pp. E4025-E4034 ◽

Cited By ~ 38

Author(s):

Giulio Caravagna ◽

Alex Graudenzi ◽

Daniele Ramazzotti ◽

Rebeca Sanz-Pamplona ◽

Luca De Sano ◽

...

Keyword(s):

Cancer Progression ◽

Current Knowledge ◽

Population Level ◽

Selective Advantage ◽

Explanatory Models ◽

Next Generation Sequencing Data ◽

Driver Mutations ◽

Sequencing Data ◽

Cross Sectional ◽

Progression Model

The genomic evolution inherent to cancer relates directly to a renewed focus on the voluminous next-generation sequencing data and machine learning for the inference of explanatory models of how the (epi)genomic events are choreographed in cancer initiation and development. However, despite the increasing availability of multiple additional -omics data, this quest has been frustrated by various theoretical and technical hurdles, mostly stemming from the dramatic heterogeneity of the disease. In this paper, we build on our recent work on the “selective advantage” relation among driver mutations in cancer progression and investigate its applicability to the modeling problem at the population level. Here, we introduce PiCnIc (Pipeline for Cancer Inference), a versatile, modular, and customizable pipeline to extract ensemble-level progression models from cross-sectional sequenced cancer genomes. The pipeline has many translational implications because it combines state-of-the-art techniques for sample stratification, driver selection, identification of fitness-equivalent exclusive alterations, and progression model inference. We demonstrate PiCnIc’s ability to reproduce much of the current knowledge on colorectal cancer progression as well as to suggest novel experimentally verifiable hypotheses.

Download Full-text

PhyDOSE: Design of Follow-up Single-cell Sequencing Experiments of Tumors

10.1101/2020.03.30.016410 ◽

2020 ◽

Author(s):

Leah Weber ◽

Nuraini Aguse ◽

Nicholas Chia ◽

Mohammed El-Kebir

Keyword(s):

Single Cell ◽

Retrospective Analysis ◽

High Fidelity ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Bulk Data ◽

Sequencing Experiment ◽

Tumor Phylogeny ◽

Number Of Cells

AbstractThe combination of bulk and single-cell DNA sequencing data of the same tumor enables the inference of high-fidelity phylogenies that form the input to many important downstream analyses in cancer genomics. While many studies simultaneously perform bulk and single-cell sequencing, some studies have analyzed initial bulk data to identify which mutations to target in a follow-up single-cell sequencing experiment, thereby decreasing cost. Bulk data provide an additional untapped source of valuable information, composed of candidate phylogenies and associated clonal prevalence. Here, we introduce PhyDOSE, a method that uses this information to strategically optimize the design of follow-up single cell experiments. Underpinning our method is the observation that only a small number of clones uniquely distinguish one candidate tree from all other trees. We incorporate distinguishing features into a probabilistic model that infers the number of cells to sequence so as to confidently reconstruct the phylogeny of the tumor. We validate PhyDOSE using simulations and a retrospective analysis of a leukemia patient, concluding that PhyDOSE’s computed number of cells resolves tree ambiguity even in the presence of typical single-cell sequencing errors. We also conduct a retrospective analysis on an acute myeloid leukemia cohort, demonstrating the potential to achieve similar results with a significant reduction in the number of cells sequenced. In a prospective analysis, we demonstrate that only a small number of cells suffice to disambiguate the solution space of trees in a recent lung cancer cohort. In summary, PhyDOSE proposes cost-efficient single-cell sequencing experiments that yield high-fidelity phylogenies, which will improve downstream analyses aimed at deepening our understanding of cancer biology.Author summaryCancer development in a patient can be explained using a phylogeny — a tree that describes the evolutionary history of a tumor and has therapeutic implications. A tumor phylogeny is constructed from sequencing data, commonly obtained using either bulk or single-cell DNA sequencing technology. The accuracy of tumor phylogeny inference increases when both types of data are used, but single-cell sequencing may become prohibitively costly with increasing number of cells. Here, we propose a method that uses bulk sequencing data to guide the design of a follow-up single-cell sequencing experiment. Our results suggest that PhyDOSE provides a significant decrease in the number of cells to sequence compared to the number of cells sequenced in existing studies. The ability to make informed decisions based on prior data can help reduce the cost of follow-up single cell sequencing experiments of tumors, improving accuracy of tumor phylogeny inference and ultimately getting us closer to understanding and treating cancer.

Download Full-text

DoubletFinder: Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors

10.1101/352484 ◽

2018 ◽

Cited By ~ 17

Author(s):

Christopher S. McGinnis ◽

Lyndsay M. Murrow ◽

Zev J. Gartner

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

False Negative ◽

Droplet Microfluidics ◽

Cell Capture ◽

Sequencing Data ◽

Putative Gene ◽

Detection Tool ◽

Single Cell Rna Sequencing

SUMMARYSingle-cell RNA sequencing (scRNA-seq) using droplet microfluidics occasionally produces transcriptome data representing more than one cell. These technical artifacts are caused by cell doublets formed during cell capture and occur at a frequency proportional to the total number of sequenced cells. The presence of doublets can lead to spurious biological conclusions, which justifies the practice of sequencing fewer cells to limit doublet formation rates. Here, we present a computational doublet detection tool – DoubletFinder – that identifies doublets based solely on gene expression features. DoubletFinder infers the putative gene expression profile of real doublets by generating artificial doublets from existing scRNA-seq data. Neighborhood detection in gene expression space then identifies sequenced cells with increased probability of being doublets based on their proximity to artificial doublets. DoubletFinder robustly identifies doublets across scRNA-seq datasets with variable numbers of cells and sequencing depth, and predicts false-negative and false-positive doublets defined using conventional barcoding approaches. We anticipate that DoubletFinder will aid in scRNA-seq data analysis and will increase the throughput and accuracy of scRNA-seq experiments.

Download Full-text

Cellsnp-lite: an efficient tool for genotyping single cells

10.1101/2020.12.31.424913 ◽

2021 ◽

Author(s):

Xianjie Huang ◽

Yuanhua Huang

Keyword(s):

Single Cell ◽

Single Cells ◽

Basic Research ◽

Substantial Improvement ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Memory Efficiency ◽

Computational Speed ◽

Cell Data

AbstractSummarySingle-cell sequencing is an increasingly used technology and has promising applications in basic research and clinical translations. However, genotyping methods developed for bulk sequencing data have not been well adapted for single-cell data, in terms of both computational parallelization and simplified user interface. Here we introduce a software, cellsnp-lite, implemented in C/C++ and based on well supported package htslib, for genotyping in single-cell sequencing data for both droplet and well based platforms. On various experimental data sets, it shows substantial improvement in computational speed and memory efficiency with retaining highly concordant results compared to existing methods. Cellsnp-lite therefore lightens the genetic analysis for increasingly large single-cell data.AvailabilityThe source code is freely available at https://github.com/single-cell-genetics/[email protected]

Download Full-text