Melissa: Bayesian clustering and imputation of single cell methylomes

Mapping Intimacies ◽

10.1101/312025 ◽

2018 ◽

Author(s):

Chantriolnt-Andreas Kapourani ◽

Guido Sanguinetti

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

Real Data ◽

Data Sets ◽

Sequencing Data ◽

Control Of Gene Expression ◽

Bayesian Hierarchical ◽

Cpg Sites ◽

Bisulfite Sequencing Data ◽

Bayesian Hierarchical Method

AbstractMeasurements of DNA methylation at the single cell level are promising to revolutionise our understanding of epigenetic control of gene expression. Yet, intrinsic limitations of the technology result in very sparse coverage of CpG sites (around 5% to 20% coverage), effectively limiting the analysis repertoire to a semi-quantitative level. Here we introduce Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq). Melissa clusters individual cells based on local methylation patterns, enabling the discovery of epigenetic differences and similarities among individual cells. The clustering also acts as an effective regularisation method for imputation of methylation on unassayed CpG sites, enabling transfer of information between individual cells. We show both on simulated and real data sets that Melissa provides accurate and biologically meaningful clusterings, and state-of-the-art imputation performance. An R implementation of Melissa is publicly available at https://github.com/andreaskapou/Melissa.

Download Full-text

A multiresolution framework to characterize single-cell state landscapes

Nature Communications ◽

10.1038/s41467-020-18416-6 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 1

Author(s):

Shahin Mohammadi ◽

Jose Davila-Velderrain ◽

Manolis Kellis

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

Real Data ◽

Cell Types ◽

Cellular Heterogeneity ◽

Superior Performance ◽

Data Sets ◽

Structural Representation ◽

Archetypal Analysis ◽

Cell State

Abstract Dissecting the cellular heterogeneity embedded in single-cell transcriptomic data is challenging. Although many methods and approaches exist, identifying cell states and their underlying topology is still a major challenge. Here, we introduce the concept of multiresolution cell-state decomposition as a practical approach to simultaneously capture both fine- and coarse-grain patterns of variability. We implement this concept in ACTIONet, a comprehensive framework that combines archetypal analysis and manifold learning to provide a ready-to-use analytical approach for multiresolution single-cell state characterization. ACTIONet provides a robust, reproducible, and highly interpretable single-cell analysis platform that couples dominant pattern discovery with a corresponding structural representation of the cell state landscape. Using multiple synthetic and real data sets, we demonstrate ACTIONet’s superior performance relative to existing alternatives. We use ACTIONet to integrate and annotate cells across three human cortex data sets. Through integrative comparative analysis, we define a consensus vocabulary and a consistent set of gene signatures discriminating against the transcriptomic cell types and subtypes of the human prefrontal cortex.

Download Full-text

SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data

10.1101/2020.02.03.930354 ◽

2020 ◽

Author(s):

Collin Giguere ◽

Harsh Vardhan Dubey ◽

Vishal Kumar Sarsani ◽

Hachem Saddiki ◽

Shai He ◽

...

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Real Data ◽

Data Sets ◽

Next Generation ◽

Sequencing Data ◽

Next Generation Dna Sequencing ◽

Accuracy And Precision ◽

Downstream Analysis ◽

Multiple Samples

AbstractBackgroundRecently, it has become possible to collect next-generation DNA sequencing data sets that are composed of multiple samples from multiple biological units where each of these samples may be from a single cell or bulk tissue. Yet, there does not yet exist a tool for simulating DNA sequencing data from such a nested sampling arrangement with single-cell and bulk samples so that developers of analysis methods can assess accuracy and precision.ResultsWe have developed a tool that simulates DNA sequencing data from hierarchically grouped (correlated) samples where each sample is designated bulk or single-cell. Our tool uses a simple configuration file to define the experimental arrangement and can be integrated into software pipelines for testing of variant callers or other genomic tools.ConclusionsThe DNA sequencing data generated by our simulator is representative of real data and integrates seamlessly with standard downstream analysis tools.

Download Full-text

Distinguishing linear and branched evolution given single-cell DNA sequencing data of tumors

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00194-5 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Leah L. Weber ◽

Mohammed El-Kebir

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Evolutionary Process ◽

Treatment Decision ◽

Real Data ◽

Current Data ◽

Fast Method ◽

Sequencing Data ◽

Evolutionary Trajectory ◽

Cancer Types

Abstract Background Cancer arises from an evolutionary process where somatic mutations give rise to clonal expansions. Reconstructing this evolutionary process is useful for treatment decision-making as well as understanding evolutionary patterns across patients and cancer types. In particular, classifying a tumor’s evolutionary process as either linear or branched and understanding what cancer types and which patients have each of these trajectories could provide useful insights for both clinicians and researchers. While comprehensive cancer phylogeny inference from single-cell DNA sequencing data is challenging due to limitations with current sequencing technology and the complexity of the resulting problem, current data might provide sufficient signal to accurately classify a tumor’s evolutionary history as either linear or branched. Results We introduce the Linear Perfect Phylogeny Flipping (LPPF) problem as a means of testing two alternative hypotheses for the pattern of evolution, which we prove to be NP-hard. We develop Phyolin, which uses constraint programming to solve the LPPF problem. Through both in silico experiments and real data application, we demonstrate the performance of our method, outperforming a competing machine learning approach. Conclusion Phyolin is an accurate, easy to use and fast method for classifying an evolutionary trajectory as linear or branched given a tumor’s single-cell DNA sequencing data.

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text

How to Get Started with Single Cell RNA Sequencing Data Analysis

Journal of the American Society of Nephrology ◽

10.1681/asn.2020121742 ◽

2021 ◽

pp. ASN.2020121742 ◽

Cited By ~ 1

Author(s):

Michael S. Balzer ◽

Ziyuan Ma ◽

Jianfu Zhou ◽

Amin Abedini ◽

Katalin Susztak

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

Cell Analysis ◽

Epigenetic Changes ◽

Sequencing Data ◽

Single Experiment ◽

Gene And Protein Expression ◽

Single Cell Rna Sequencing ◽

Analytical Tools ◽

Sequencing Data Analysis

Over the last 5 years, single cell methods have enabled the monitoring of gene and protein expression, genetic, and epigenetic changes in thousands of individual cells in a single experiment. With the improved measurement and the decreasing cost of the reactions and sequencing, the size of these datasets is increasing rapidly. The critical bottleneck remains the analysis of the wealth of information generated by single cell experiments. In this review, we give a simplified overview of the analysis pipelines, as they are typically used in the field today. We aim to enable researchers starting out in single cell analysis to gain an overview of challenges and the most commonly used analytical tools. In addition, we hope to empower others to gain an understanding of how typical readouts from single cell datasets are presented in the published literature.

Download Full-text

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbz096 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1581-1595 ◽

Cited By ~ 6

Author(s):

Xinlei Zhao ◽

Shuang Wu ◽

Nan Fang ◽

Xiao Sun ◽

Jue Fan

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Reference Data ◽

Predictive Accuracy ◽

Cell Types ◽

Superior Performance ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

Download Full-text

MethylStar: A fast and robust pre-processing pipeline for bulk or single-cell whole-genome bisulfite sequencing data

BMC Genomics ◽

10.1186/s12864-020-06886-3 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Yadollah Shahryary ◽

Rashmi R. Hazarika ◽

Frank Johannes

Keyword(s):

Single Cell ◽

Bisulfite Sequencing ◽

Whole Genome ◽

Sequencing Data ◽

Whole Genome Bisulfite Sequencing ◽

Processing Pipeline ◽

Genome Bisulfite Sequencing ◽

Bisulfite Sequencing Data

Download Full-text

CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures

10.1101/699041 ◽

2019 ◽

Cited By ~ 3

Author(s):

Thomas D. Sherman ◽

Tiger Gao ◽

Elana J. Fertig

Keyword(s):

Single Cell ◽

Data Structures ◽

Computational Efficiency ◽

Matrix Factorization ◽

Single Cell Analysis ◽

Sparse Data ◽

Data Sets ◽

Cell Analysis ◽

Gradient Based ◽

Cell Data

AbstractMotivationBayesian factorization methods, including Coordinated Gene Activity in Pattern Sets (CoGAPS), are emerging as powerful analysis tools for single cell data. However, these methods have greater computational costs than their gradient-based counterparts. These costs are often prohibitive for analysis of large single-cell datasets. Many such methods can be run in parallel which enables this limitation to be overcome by running on more powerful hardware. However, the constraints imposed by the prior distributions in CoGAPS limit the applicability of parallelization methods to enhance computational efficiency for single-cell analysis.ResultsWe upgraded CoGAPS in Version 3 to overcome the computational limitations of Bayesian matrix factorization for single cell data analysis. This software includes a new parallelization framework that is designed around the sequential updating steps of the algorithm to enhance computational efficiency. These algorithmic advances were coupled with new software architecture and sparse data structures to reduce the memory overhead for single-cell data. Altogether, these updates to CoGAPS enhance the efficiency of the algorithm so that it can analyze 1000 times more cells, enabling factorization of large single-cell data sets.AvailabilityCoGAPS is available as a Bioconductor package and the source code is provided at github.com/FertigLab/CoGAPS. All efficiency updates to enable single-cell analysis available as of version [email protected]

Download Full-text

DrivAER: Identification of driving transcriptional programs in single-cell RNA sequencing data

10.1101/864165 ◽

2019 ◽

Author(s):

Lukas M. Simon ◽

Fangfang Yan ◽

Zhongming Zhao

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Disease Status ◽

Data Sets ◽

Sequencing Data ◽

Functional Interpretation ◽

Recent Success ◽

Gene Sets ◽

Single Cell Rna Sequencing ◽

Cellular Maps

AbstractSingle cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic data sets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps. Here, we present DrivAER, a machine learning approach that scores annotated gene sets based on their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. We demonstrate that DrivAER extracts the key driving pathways and transcription factors that regulate complex biological processes from scRNA-seq data.

Download Full-text