PsiNorm: a scalable normalization for single-cell RNA-seq data

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) enables transcriptome-wide gene expression measurements at single-cell resolution providing a comprehensive view of the compositions and dynamics of tissue and organism development. The evolution of scRNA-seq protocols has led to a dramatic increase of cells throughput, exacerbating many of the computational and statistical issues that previously arose for bulk sequencing. In particular, with scRNA-seq data all the analyses steps, including normalization, have become computationally intensive, both in terms of memory usage and computational time. In this perspective, new accurate methods able to scale efficiently are desirable. Results Here, we propose PsiNorm, a between-sample normalization method based on the power-law Pareto distribution parameter estimate. Here, we show that the Pareto distribution well resembles scRNA-seq data, especially those coming from platforms that use unique molecular identifiers. Motivated by this result, we implement PsiNorm, a simple and highly scalable normalization method. We benchmark PsiNorm against seven other methods in terms of cluster identification, concordance and computational resources required. We demonstrate that PsiNorm is among the top performing methods showing a good trade-off between accuracy and scalability. Moreover, PsiNorm does not need a reference, a characteristic that makes it useful in supervised classification settings, in which new out-of-sample data need to be normalized. Availability and implementation PsiNorm is implemented in the scone Bioconductor package and available at https://bioconductor.org/packages/scone/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PsiNorm: a scalable normalization for single-cell RNA-seq data

10.1101/2021.04.07.438822 ◽

2021 ◽

Author(s):

Matteo Borella ◽

Graziano Martello ◽

Davide Risso ◽

Chiara Romualdi

Keyword(s):

Single Cell ◽

Pareto Distribution ◽

R Package ◽

Parameter Estimate ◽

Normalization Method ◽

Computational Time ◽

Out Of Sample ◽

Computationally Intensive ◽

Good Trade ◽

Computational Resources

AbstractSingle-cell RNA sequencing (scRNA-seq) enables transcriptome-wide gene expression measurements at single-cell resolution providing a comprehensive view of the compositions and dynamics of tissue and organism development. The evolution of scRNA-seq protocols has led to a dramatic increase of cells throughput, exacerbating many of the computational and statistical issues that previously arose for bulk sequencing. In particular, with scRNA-seq data all the analyses steps, including normalization, have become computationally intensive, both in terms of memory usage and computational time. In this perspective, new accurate methods able to scale efficiently are desirable.Here we propose PsiNorm, a between-sample normalization method based on the power-law Pareto distribution parameter estimate. Here we show that the Pareto distribution well resembles scRNA-seq data, independently of sequencing depths and technology. Motivated by this result, we implement PsiNorm, a simple and highly scalable normalization method. We benchmark PsiNorm with other seven methods in terms of cluster identification, concordance and computational resources required. We demonstrate that PsiNorm is among the top performing methods showing a good trade-off between accuracy and scalability. Moreover PsiNorm does not need a reference, a characteristic that makes it useful in supervised classification settings, in which new out-of-sample data need to be normalized.PsiNorm is available as an R package available at https://github.com/MatteoBlla/PsiNorm

Download Full-text

BBKNN: fast batch alignment of single cell transcriptomes

Bioinformatics ◽

10.1093/bioinformatics/btz625 ◽

2019 ◽

Cited By ~ 33

Author(s):

Krzysztof Polański ◽

Matthew D Young ◽

Zhichao Miao ◽

Kerstin B Meyer ◽

Sarah A Teichmann ◽

...

Keyword(s):

Data Integration ◽

Single Cell ◽

Large Scale ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Batch Effects ◽

Integration Algorithm ◽

Data Explosion ◽

Computationally Intensive

Abstract Motivation Increasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. A number of methods have been developed to combine diverse datasets by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration algorithm. We illustrate the power of BBKNN on large scale mouse atlasing data, and favourably benchmark its run time against a number of competing methods. Availability and implementation BBKNN is available at https://github.com/Teichlab/bbknn, along with documentation and multiple example notebooks, and can be installed from pip. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RUV-III-NB: Normalization of single cell RNA-seq Data

10.1101/2021.11.06.467575 ◽

2021 ◽

Author(s):

Agus Salim ◽

Ramyar Molania ◽

Jianan Wang ◽

Alysha De Livera ◽

Rachel Thijssen ◽

...

Keyword(s):

Single Cell ◽

Cell Biology ◽

R Package ◽

Normalization Method ◽

Supplementary Information ◽

Rna Seq ◽

Batch Effects ◽

Library Size ◽

Number Of Factors ◽

Technological Platforms

Motivation: Despite numerous methodological advances, the normalization of single cell RNA-seq (scRNA-seq) data remains a challenging task. The performance of different methods can vary greatly across datasets. Part of the reason for this is the different kinds of unwanted variation, including library size, batch and cell cycle effects, and the association of these with the biology embodied in the cells. A normalization method that does not explicitly take into account cell biology risks removing some of the signal of interest. Here we propose RUV-III-NB, a statistical method that can be used to adjust counts for library size and batch effects. The method uses the concept of pseudo-replicates to ensure that relevant features of the unwanted variation are only inferred from cells with the same biology and return adjusted sequencing count as output. Results: Using five publicly available datasets that encompass different technological platforms, kinds of biology and levels of association between biology and unwanted variation, we show that RUV-III-NB manages to remove library size and batch effects, strengthen biological signals, improve differential expression analyses, and lead to results exhibiting greater concordance with independent datasets of the same kind. The performance of RUV-III-NB is consistent across the five datasets and is not sensitive to the number of factors assumed to contribute to the unwanted variation. It also shows promise for removing other kinds of unwanted variation such as platform effects. Availability: The method is implemented as a publicly available R package available from https://github.com/limfuxing/ruvIIInb. Contact: [email protected], [email protected] Supplementary information: Online Supplementary Methods

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Using single-cell cytometry to illustrate integrated multi-perspective evaluation of clustering algorithms using Pareto fronts

Bioinformatics ◽

10.1093/bioinformatics/btab038 ◽

2021 ◽

Author(s):

Givanna H Putri ◽

Irena Koprinska ◽

Thomas M Ashhurst ◽

Nicholas J C King ◽

Mark N Read

Keyword(s):

Single Cell ◽

Performance Metrics ◽

Clustering Algorithms ◽

Latin Hypercube Sampling ◽

Supplementary Information ◽

Sequencing Data ◽

Evaluation Protocol ◽

Benchmark Datasets ◽

Pareto Fronts ◽

Parameter Values

Abstract Motivation Many ‘automated gating’ algorithms now exist to cluster cytometry and single-cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics emphasize different aspects of clustering performance and do not rank clustering solutions in the same order. This underlies the lack of consensus between comparative studies regarding optimal clustering algorithms and undermines the translatability of results onto other non-benchmark datasets. Results We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individual metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin Hypercube sampling method, our evaluation protocol minimizes (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between three clustering algorithms (ChronoClust, FlowSOM and Phenograph) using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain. Availability and implementation Implementation of our Pareto front methodology and all scripts and datasets to reproduce this article are available at https://github.com/ghar1821/ParetoBench. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TreeMerge: a new method for improving the scalability of species tree estimation methods

Bioinformatics ◽

10.1093/bioinformatics/btz344 ◽

2019 ◽

Vol 35 (14) ◽

pp. i417-i426 ◽

Cited By ~ 7

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Large Scale ◽

Species Tree ◽

New Method ◽

Divide And Conquer ◽

Supplementary Information ◽

Estimation Methods ◽

Running Time ◽

Tree Estimation ◽

Computationally Intensive ◽

A Minor

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identification of cell-type-specific marker genes from co-expression patterns in tissue samples

Bioinformatics ◽

10.1093/bioinformatics/btab257 ◽

2021 ◽

Author(s):

Yixuan Qiu ◽

Jiebiao Wang ◽

Jing Lei ◽

Kathryn Roeder

Keyword(s):

Single Cell ◽

Expression Patterns ◽

R Package ◽

Supplementary Information ◽

Marker Genes ◽

Specific Marker ◽

Cell Type ◽

Correlation Pattern ◽

Tissue Samples ◽

Bulk Data

Abstract Motivation Marker genes, defined as genes that are expressed primarily in a single cell type, can be identified from the single cell transcriptome; however, such data are not always available for the many uses of marker genes, such as deconvolution of bulk tissue. Marker genes for a cell type, however, are highly correlated in bulk data, because their expression levels depend primarily on the proportion of that cell type in the samples. Therefore, when many tissue samples are analyzed, it is possible to identify these marker genes from the correlation pattern. Results To capitalize on this pattern, we develop a new algorithm to detect marker genes by combining published information about likely marker genes with bulk transcriptome data in the form of a semi-supervised algorithm. The algorithm then exploits the correlation structure of the bulk data to refine the published marker genes by adding or removing genes from the list. Availability and implementation We implement this method as an R package markerpen, hosted on CRAN (https://CRAN.R-project.org/package=markerpen). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SMILE: Mutual Information Learning for Integration of Single-cell Omics Data

Bioinformatics ◽

10.1093/bioinformatics/btab706 ◽

2021 ◽

Author(s):

Yang Xu ◽

Priyojit Das ◽

Rachel Patton McCord

Keyword(s):

Deep Learning ◽

Mutual Information ◽

Single Cell ◽

Learning Algorithm ◽

Cellular Systems ◽

Supplementary Information ◽

Omics Data ◽

Learning Approaches ◽

Rna Seq ◽

Integrate Data

Abstract Motivation Deep learning approaches have empowered single-cell omics data analysis in many ways and generated new insights from complex cellular systems. As there is an increasing need for single cell omics data to be integrated across sources, types, and features of data, the challenges of integrating single-cell omics data are rising. Here, we present an unsupervised deep learning algorithm that learns discriminative representations for single-cell data via maximizing mutual information, SMILE (Single-cell Mutual Information Learning). Results Using a unique cell-pairing design, SMILE successfully integrates multi-source single-cell transcriptome data, removing batch effects and projecting similar cell types, even from different tissues, into the shared space. SMILE can also integrate data from two or more modalities, such as joint profiling technologies using single-cell ATAC-seq, RNA-seq, DNA methylation, Hi-C, and ChIP data. When paired cells are known, SMILE can integrate data with unmatched feature, such as genes for RNA-seq and genome wide peaks for ATAC-seq. Integrated representations learned from joint profiling technologies can then be used as a framework for comparing independent single source data. Supplementary information Supplementary data are available at Bioinformatics online. The source code of SMILE including analyses of key results in the study can be found at: https://github.com/rpmccordlab/SMILE.

Download Full-text

Vortex Suppression Efficiency of Discontinuous Helicoidal Fins

Volume 3: Pipeline and Riser Technology; CFD and VIV ◽

10.1115/omae2007-29255 ◽

2007 ◽

Cited By ~ 4

Author(s):

Antonio Pinto ◽

Riccardo Broglia ◽

Elena Ciappi ◽

Andrea Di Mascio ◽

Emilio F. Campana ◽

...

Keyword(s):

Three Dimensional ◽

Computational Time ◽

Reynolds Number Flow ◽

Flow Past A Cylinder ◽

Helical Strakes ◽

Unsteady Rans ◽

Number Flow ◽

Offshore Industry ◽

Good Trade ◽

High Reynolds

Vortex-Induced Vibration (VIV) is one of the most demanding areas in the offshore industry, and detailed investigation of the fluid-structure interaction is becoming fundamental for designing new structures able to reduce VIV phenomenon. To carry on such analysis, and get reliable results in term of global coefficients, the correct modelling of turbulence, boundary layer, and separated flows is required. Nonetheless, the more accurate is the simulation, the more costly is the computation. Unsteady RANS simulations provide a good trade-off between numerical accuracy and computational time. This paper presents the analysis of the flow past a cylinder with several three-dimensional helical fins at high Reynolds number. Flow field, vortical structures, and response frequency patterns are analysed. Spectral analysis of data is performed to identify carrier frequencies, deemed to be critical due to the induced vibration of the whole structure. Finally, helical strakes efficiency in reducing the riser vibrations is also addressed, through direct consideration on the carrier shedding frequency.

Download Full-text

Ensemble Classification through Random Projections for single-cell RNA-seq data

10.1101/2020.06.24.169136 ◽

2020 ◽

Author(s):

Aristidis G. Vrahatis ◽

Sotiris Tasoulis ◽

Spiros Georgakopoulos ◽

Vassilis Plagianakos

Keyword(s):

Single Cell ◽

Random Projection ◽

Classification Performance ◽

Majority Voting ◽

Ensemble Classification ◽

High Dimensionality ◽

Computational Time ◽

Biomedical Data ◽

Rna Seq ◽

Low Dimensional

AbstractNowadays the biomedical data are generated exponentially, creating datasets for analysis with ultra-high dimensionality and complexity. This revolution, which has been caused by recent advents in biotechnologies, has driven to big-data and data-driven computational approaches. An indicative example is the emerging single-cell RNA-sequencing (scRNA-seq) technology, which isolates and measures individual cells. Although scRNA-seq has revolutionized the biotechnology domain, such data computational analysis is a major challenge because of their ultra-high dimensionality and complexity. Following this direction, in this work we study the properties, effectiveness and generalization of the recently proposed MRPV algorithm for single cell RNA-seq data. MRPV is an ensemble classification technique utilizing multiple ultra-low dimensional Random Projected spaces. A given classifier determines the class for each sample for all independent spaces while a majority voting scheme defines their predominant class. We show that Random Projection ensembles offer a platform not only for a low computational time analysis but also for enhancing classification performance. The developed methodologies were applied to four real biomedical high dimensional data from single-cell RNA-seq studies and compared against well-known and similar classification tools. Experimental results showed that based on simplistic tools we can create a computationally fast, simple, yet effective approach for single cell RNA-seq data with ultra-high dimensionality.

Download Full-text