Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach

Bioinformatics ◽

10.1093/bioinformatics/btz676 ◽

2019 ◽

Author(s):

Yufeng Wu

Keyword(s):

Single Cell ◽

Cell Lineage ◽

Large Data ◽

Genomic Variation ◽

Supplementary Information ◽

Perfect Phylogeny ◽

Tree Inference ◽

Lineage Tree ◽

Infinite Sites Model ◽

Cell Data

Abstract Motivation Cells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling based and can be very slow for large data. Results In this article, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets. Availability and implementation The program ScisTree is available for download at: https://github.com/yufengwudcs/ScisTree. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate and Efficient Cell Lineage Tree Inference from Noisy Single Cell Data: the Maximum Likelihood Perfect Phylogeny Approach

10.1101/742395 ◽

2019 ◽

Author(s):

Yufeng Wu

Keyword(s):

Single Cell ◽

Cell Lineage ◽

Large Data ◽

Genomic Variation ◽

Perfect Phylogeny ◽

Tree Inference ◽

Lineage Tree ◽

Infinite Sites Model ◽

New Applications ◽

Cell Data

AbstractCells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling-based and can be very slow for large data.In this paper, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets.AvailabilityThe program ScisTree is available for download at: https://github.com/yufengwudcs/[email protected]

Download Full-text

Scelestial: fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm

10.1101/2021.05.24.445405 ◽

2021 ◽

Author(s):

Mohammad-Hadi Foroughmand-Araabi ◽

Sama Goliaei ◽

Alice Carolyn McHardy

Keyword(s):

Approximation Algorithm ◽

Single Cell ◽

Steiner Tree ◽

Missing Values ◽

Cell Lineage ◽

Error Rates ◽

Steiner Tree Problem ◽

Tree Reconstruction ◽

Tree Inference ◽

Lineage Tree

Single-cell genome sequencing provides a highly granular view of biological systems but is affected by high error rates, allelic amplification bias, and uneven genome coverage. This creates a need for data-specific computational methods, for purposes such as for cell lineage tree inference. The objective of cell lineage tree reconstruction is to infer the evolutionary process that generated a set of observed cell genomes. Lineage trees may enable a better understanding of tumor formation and growth, as well as of organ development for healthy body cells. We describe a method, Scelestial, for lineage tree reconstruction from single-cell data, which is based on an approximation algorithm for the Steiner tree problem and is a generalization of the neighbor-joining method. We adapt the algorithm to efficiently select a limited subset of potential sequences as internal nodes, in the presence of missing values, and to minimize cost by lineage tree-based missing value imputation. In a comparison against seven state-of-the-art single-cell lineage tree reconstruction algorithms - BitPhylogeny, OncoNEM, SCITE, SiFit, SASC, SCIPhI, and SiCloneFit - on simulated and real single-cell tumor samples, Scelestial performed best at reconstructing trees in terms of accuracy and run time. Scelestial has been implemented in C++. It is also available as an R package named RScelestial.

Download Full-text

Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo

Science ◽

10.1126/science.aar4362 ◽

2018 ◽

Vol 360 (6392) ◽

pp. 981-987 ◽

Cited By ~ 278

Author(s):

Daniel E. Wagner ◽

Caleb Weinreb ◽

Zach M. Collins ◽

James A. Briggs ◽

Sean G. Megason ◽

...

Keyword(s):

Single Cell ◽

Zebrafish Embryo ◽

Cell Lineage ◽

Vertebrate Development ◽

Cell Mapping ◽

Cell Fates ◽

Web Based ◽

A Cell ◽

Cell Data ◽

Germ Layer Formation

High-throughput mapping of cellular differentiation hierarchies from single-cell data promises to empower systematic interrogations of vertebrate development and disease. Here we applied single-cell RNA sequencing to >92,000 cells from zebrafish embryos during the first day of development. Using a graph-based approach, we mapped a cell-state landscape that describes axis patterning, germ layer formation, and organogenesis. We tested how clonally related cells traverse this landscape by developing a transposon-based barcoding approach (TracerSeq) for reconstructing single-cell lineage histories. Clonally related cells were often restricted by the state landscape, including a case in which two independent lineages converge on similar fates. Cell fates remained restricted to this landscape in embryos lacking the chordin gene. We provide web-based resources for further analysis of the single-cell data.

Download Full-text

Ensemble learning for classifying single-cell data and projection across reference atlases

Bioinformatics ◽

10.1093/bioinformatics/btaa137 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3585-3587

Author(s):

Lin Wang ◽

Francisca Catalan ◽

Karin Shamardani ◽

Husam Babikir ◽

Aaron Diaz

Keyword(s):

Single Cell ◽

Cell Types ◽

Status Quo ◽

Supplementary Information ◽

Published Data ◽

Supplementary Data ◽

Cell Type ◽

Low Sensitivity ◽

Project Data ◽

Cell Data

Abstract Summary Single-cell data are being generated at an accelerating pace. How best to project data across single-cell atlases is an open problem. We developed a boosted learner that overcomes the greatest challenge with status quo classifiers: low sensitivity, especially when dealing with rare cell types. By comparing novel and published data from distinct scRNA-seq modalities that were acquired from the same tissues, we show that this approach preserves cell-type labels when mapping across diverse platforms. Availability and implementation https://github.com/diazlab/ELSA Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution

10.1101/567115 ◽

2019 ◽

Author(s):

Magdalena E Strauss ◽

Paul DW Kirk ◽

John E Reid ◽

Lorenz Wernisch

Keyword(s):

Single Cell ◽

Time Course ◽

Gene Clusters ◽

Supplementary Information ◽

Clustering Methods ◽

Link Type ◽

Novel Approach ◽

Broad Array ◽

Recent Method ◽

Cell Data

AbstractMotivationMany methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters.ResultsThe proposed method, GPseudoClust, is a novel approach that jointly infers pseudotem-poral ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with nonparametric Bayesian clustering methods, efficient MCMC sampling, and novel subsampling strategies which aid computation. We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings.AvailabilityAn implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/[email protected] informationSupplementary materials are available.

Download Full-text

Single cell network analysis with a mixture of Nested Effects Models

10.1101/258202 ◽

2018 ◽

Author(s):

Martin Pirkl ◽

Niko Beerenwinkel

Keyword(s):

Single Cell ◽

New Technologies ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Data Sets ◽

Cell Network ◽

A Cell ◽

Supplementary Material ◽

Cell Data

AbstractMotivationNew technologies allow for the elaborate measurement of different traits of single cells. These data promise to elucidate intra-cellular networks in unprecedented detail and further help to improve treatment of diseases like cancer. However, cell populations can be very heterogeneous.ResultsWe developed a mixture of Nested Effects Models (M&NEM) for single-cell data to simultaneously identify different cellular sub-populations and their corresponding causal networks to explain the heterogeneity in a cell population. For inference, we assign each cell to a network with a certain probability and iteratively update the optimal networks and cell probabilities in an Expectation Maximization scheme. We validate our method in the controlled setting of a simulation study and apply it to three data sets of pooled CRISPR screens generated previously by two novel experimental techniques, namely Crop-Seq and Perturb-Seq.AvailabilityThe mixture Nested Effects Model (M&NEM) is available as the R-package mnem at https://github.com/cbgethz/mnem/[email protected], [email protected] informationSupplementary data are available.online.

Download Full-text

Robust Lineage Reconstruction from High-Dimensional Single-Cell Data

10.1101/036533 ◽

2016 ◽

Author(s):

Gregory Giecold ◽

Eugenio Marco ◽

Lorenzo Trippa ◽

Guo-Cheng Yuan

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Quantitative Estimate ◽

Cell Lineage ◽

Computational Method ◽

Expression Data ◽

Cell Gene Expression ◽

Cell Data ◽

Cell Gene

Single-cell gene expression data provide invaluable resources for systematic characterization of cellular hierarchy in multi-cellular organisms. However, cell lineage reconstruction is still often associated with significant uncertainty due to technological constraints. Such uncertainties have not been taken into account in current methods. We present ECLAIR, a novel computational method for the statistical inference of cell lineage relationships from single-cell gene expression data. ECLAIR uses an ensemble approach to improve the robustness of lineage predictions, and provides a quantitative estimate of the uncertainty of lineage branchings. We show that the application of ECLAIR to published datasets successfully reconstructs known lineage relationships and significantly improves the robustness of predictions. In conclusion, ECLAIR is a powerful bioinformatics tool for single-cell data analysis. It can be used for robust lineage reconstruction with quantitative estimate of prediction accuracy.

Download Full-text

Decision tree models and cell fate choice

10.1101/2020.12.19.423629 ◽

2020 ◽

Author(s):

Ivan Croydon Veleslavov ◽

Michael P.H. Stumpf

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Fate ◽

Cell Types ◽

Lineage Tree ◽

Tree Models ◽

Fate Decision ◽

Average Gene ◽

Lineage Trees ◽

Cell Data

AbstractSingle cell transcriptomics has laid bare the heterogeneity of apparently identical cells at the level of gene expression. For many cell-types we now know that there is variability in the abundance of many transcripts, and that average transcript abun-dance or average gene expression can be a unhelpful concept. A range of clustering and other classification methods have been proposed which use the signal in single cell data to classify, that is assign cell types, to cells based on their transcriptomic states. In many cases, however, we would like to have not just a classifier, but also a set of interpretable rules by which this classification occurs. Here we develop and demonstrate the interpretive power of one such approach, which sets out to establish a biologically interpretable classification scheme. In particular we are interested in capturing the chain of regulatory events that drive cell-fate decision making across a lineage tree or lineage sequence. We find that suitably defined decision trees can help to resolve gene regulatory programs involved in shaping lineage trees. Our approach combines predictive power with interpretabilty and can extract logical rules from single cell data.

Download Full-text

Conifer: clonal tree inference for tumor heterogeneity with single-cell and bulk sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04338-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Leila Baghaarabani ◽

Sama Goliaei ◽

Mohammad-Hadi Foroughmand-Araabi ◽

Seyed Peyman Shariatpanahi ◽

Bahram Goliaei

Keyword(s):

Single Cell ◽

Tumor Heterogeneity ◽

Temporal Order ◽

Variant Allele ◽

Evolutionary Relationships ◽

Sequencing Data ◽

Variant Allele Frequency ◽

Single Cell Sequencing ◽

Tree Inference ◽

Cell Data

Abstract Background Genetic heterogeneity of a cancer tumor that develops during clonal evolution is one of the reasons for cancer treatment failure, by increasing the chance of drug resistance. Clones are cell populations with different genotypes, resulting from differences in somatic mutations that occur and accumulate during cancer development. An appropriate approach for identifying clones is determining the variant allele frequency of mutations that occurred in the tumor. Although bulk sequencing data can be used to provide that information, the frequencies are not informative enough for identifying different clones with the same prevalence and their evolutionary relationships. On the other hand, single-cell sequencing data provides valuable information about branching events in the evolution of a cancerous tumor. However, the temporal order of mutations may be determined with ambiguities using only single-cell data, while variant allele frequencies from bulk sequencing data can provide beneficial information for inferring the temporal order of mutations with fewer ambiguities. Result In this study, a new method called Conifer (ClONal tree Inference For hEterogeneity of tumoR) is proposed which combines aggregated variant allele frequency from bulk sequencing data with branching event information from single-cell sequencing data to more accurately identify clones and their evolutionary relationships. It is proven that the accuracy of clone identification and clonal tree inference is increased by using Conifer compared to other existing methods on various sets of simulated data. In addition, it is discussed that the evolutionary tree provided by Conifer on real cancer data sets is highly consistent with information in both bulk and single-cell data. Conclusions In this study, we have provided an accurate and robust method to identify clones of tumor heterogeneity and their evolutionary history by combining single-cell and bulk sequencing data.

Download Full-text

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

Bioinformatics ◽

10.1093/bioinformatics/btaa042 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2778-2786 ◽

Cited By ~ 5

Author(s):

Shobana V Stassen ◽

Dickson M D Siu ◽

Kelvin C M Lee ◽

Joshua W K Ho ◽

Hayden K H So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Phenotypic Data ◽

Scalable Algorithm ◽

Cell Data

Abstract Motivation New single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity. Results We introduce a highly scalable graph-based clustering algorithm PARC—Phenotyping by Accelerated Refined Community-partitioning—for large-scale, high-dimensional single-cell data (>1 million cells). Using large single-cell flow and mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without subsampling of cells, including Phenograph, FlowSOM and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single-cell dataset of 1.1 million cells within 13 min, compared with >2 h for the next fastest graph-clustering algorithm. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis. Availability and implementation https://github.com/ShobiStassen/PARC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text