scholarly journals SpatialDE2: Fast and localized variance component analysis of spatial transcriptomics

2021 ◽  
Author(s):  
Ilia Kats ◽  
Roser Vento-Tormo ◽  
Oliver Stegle

Spatial transcriptomics is now a mature technology, allowing to assay gene expression changes in the histological context of complex tissues. A canonical analysis workflow starts with the identification of tissue zones that share similar expression profiles, followed by the detection of highly variable or spatially variable genes. Rapid increases in the scale and complexity of spatial transcriptomic datasets demand that these analysis steps are conducted in a consistent and integrated manner, a requirement that is not met by current methods. To address this, we here present SpatialDE2, which unifies the mapping of tissue zones and spatial variable gene detection as integrated software framework, while at the same time advancing current algorithms for both of these steps. Formulated in a Bayesian framework, the model accounts for the Poisson count noise, while simultaneously offering superior computational speed compared to previous methods. We validate SpatialDE2 using simulated data and illustrate its utility in the context of two real-world applications to the spatial transcriptomics profiles of the mouse brain and human endometrium.

Author(s):  
Darren J. Croton

AbstractThe Hubble constant, H0, or its dimensionless equivalent, “little h”, is a fundamental cosmological property that is now known to an accuracy better than a few per cent. Despite its cosmological nature, little h commonly appears in the measured properties of individual galaxies. This can pose unique challenges for users of such data, particularly with survey data. In this paper we show how little h arises in the measurement of galaxies, how to compare like-properties from different datasets that have assumed different little h cosmologies, and how to fairly compare theoretical data with observed data, where little h can manifest in vastly different ways. This last point is particularly important when observations are used to calibrate galaxy formation models, as calibrating with the wrong (or no) little h can lead to disastrous results when the model is later converted to the correct h cosmology. We argue that in this modern age little h is an anachronism, being one of least uncertain parameters in astrophysics, and we propose that observers and theorists instead treat this uncertainty like any other. We conclude with a ‘cheat sheet’ of nine points that should be followed when dealing with little h in data analysis.


2018 ◽  
Vol 34 (1) ◽  
pp. 121-148
Author(s):  
Jonathan Lisic ◽  
Hejian Sang ◽  
Zhengyuan Zhu ◽  
Stephanie Zimmer

Abstract A computational approach to optimal multivariate designs with respect to stratification and allocation is investigated under the assumptions of fixed total allocation, known number of strata, and the availability of administrative data correlated with thevariables of interest under coefficient-of-variation constraints. This approach uses a penalized objective function that is optimized by simulated annealing through exchanging sampling units and sample allocations among strata. Computational speed is improved through the use of a computationally efficient machine learning method such as K-means to create an initial stratification close to the optimal stratification. The numeric stability of the algorithm has been investigated and parallel processing has been employed where appropriate. Results are presented for both simulated data and USDA’s June Agricultural Survey. An R package has also been made available for evaluation.


2017 ◽  
Author(s):  
Valentine Svensson ◽  
Sarah A Teichmann ◽  
Oliver Stegle

Technological advances have enabled low-input RNA-sequencing, paving the way for assaying transcriptome variation in spatial contexts, including in tissues. While the generation of spatially resolved transcriptome maps is increasingly feasible, computational methods for analysing the resulting data are not established. Existing analysis strategies either ignore the spatial component of gene expression variation, or require discretization of the cells into coarse grained groups.To address this, we have developed SpatialDE, a computational framework for identifying and characterizing spatially variable genes. Our method generalizes variable gene selection, as used in population-and single-cell studies, to spatial expression profiles. To illustrate the broad utility of our approach, we apply SpatialDE to spatial transcriptomics data, and to data from single cell methods based on multiplexed in situ hybridisation (SeqFISH and MERFISH). SpatialDE enables the statistically robust identification of spatially variable genes, thereby identifying genes with known disease implications, several of which are missed by conventional variable gene selection. Additionally, to enable gene-expressed based histology, SpatialDE implements a spatial gene clustering model which we call “automatic expression histology,” allowing to classify genes into groups with distinct spatial patterns.


2016 ◽  
Author(s):  
Aaron T. L. Lun ◽  
John C. Marioni

AbstractAn increasing number of studies are using single-cell RNA-sequencing (scRNA-seq) to characterize the gene expression profiles of individual cells. One common analysis applied to scRNA-seq data involves detecting differentially expressed (DE) genes between cells in different biological groups. However, many experiments are designed such that the cells to be compared are processed in separate plates or chips, meaning that the groupings are confounded with systematic plate effects. This confounding aspect is frequently ignored in DE analyses of scRNA-seq data. In this article, we demonstrate that failing to consider plate effects in the statistical model results in loss of type I error control. A solution is proposed whereby counts are summed from all cells in each plate and the count sums for all plates are used in the DE analysis. This restores type I error control in the presence of plate effects without compromising detection power in simulated data. Summation is also robust to varying numbers and library sizes of cells on each plate. Similar results are observed in DE analyses of real data where the use of count sums instead of single-cell counts improves specificity and the ranking of relevant genes. This suggests that summation can assist in maintaining statistical rigour in DE analyses of scRNA-seq data with plate effects.


2016 ◽  
Author(s):  
Bahman Afsari ◽  
Theresa Guo ◽  
Michael Considine ◽  
Liliana Florea ◽  
Luciane T. Kagohara ◽  
...  

AbstractMotivationCurrent bioinformatics methods to detect changes in gene isoform usage in distinct phenotypes compare the relative expected isoform usage in phenotypes. These statistics model differences in isoform usage in normal tissues, which have stable regulation of gene splicing. Pathological conditions, such as cancer, can have broken regulation of splicing that increases the heterogeneity of the expression of splice variants. Inferring events with such differential heterogeneity in gene isoform usage requires new statistical approaches.ResultsWe introduce Splice Expression Variability Analysis (SEVA) to model increased heterogeneity of splice variant usage between conditions (e.g., tumor and normal samples). SEVA uses a rank-based multivariate statistic that compares the variability of junction expression profiles within one condition to the variability within another. Simulated data show that SEVA is unique in modeling heterogeneity of gene isoform usage, and benchmark SEVA’s performance against EBSeq, DiffSplice, and rMATS that model differential isoform usage instead of heterogeneity. We confirm the accuracy of SEVAin identifying known splice variants in head and neck cancer and perform cross-study validation of novel splice variants. A novel comparison of splice variant heterogeneity between subtypes of head and neck cancer demonstrated unanticipated similarity between the heterogeneity of gene isoform usage in HPV-positive and HPV-negative subtypes and anticipated increased heterogeneity among HPV-negative samples with mutations in genes that regulate the splice variant machinery.ConclusionThese results show that SEVA accurately models differential heterogeneity of gene isoform usage from RNA-seq data.AvailabilitySEVA is implemented in the R/Bioconductor package [email protected],[email protected],[email protected]


2021 ◽  
Author(s):  
Axel Peytavin ◽  
Bruno Sainte-Rose ◽  
Gael Forget ◽  
Jean-Michel Campin

Abstract. A numerical scheme to perform data assimilation of concentration measurements in Lagrangian models is presented, along with its first implementation called Ocean Plastic Assimilator, which aims at improving predictions of plastics distributions over the oceans. This scheme uses an ensemble method over a set of particle dispersion simulations. At each step, concentration observations are assimilated across the ensemble members by switching back and forth between Eulerian and Lagrangian representations. We design two experiments to assess the scheme efficacy and efficiency when assimilating simulated data in a simple double gyre model. Analysis convergence is observed with higher accuracy when lowering observation variance or using a more suitable circulation model. Results show that the distribution of plastic mass in an area can effectively be approached with this simple assimilation scheme. Thus, this method is considered a suitable candidate for creating a tool to assimilate plastic concentration observations in real-world applications to forecast plastic distributions in the oceans. Finally, several improvements that could further enhance the method efficiency are identified.


2017 ◽  
Author(s):  
Florian Wagner ◽  
Yun Yan ◽  
Itai Yanai

High-throughput single-cell RNA-Seq (scRNA-Seq) is a powerful approach for studying heterogeneous tissues and dynamic cellular processes. However, compared to bulk RNA-Seq, single-cell expression profiles are extremely noisy, as they only capture a fraction of the transcripts present in the cell. Here, we propose the k-nearest neighbor smoothing (kNN-smoothing) algorithm, designed to reduce noise by aggregating information from similar cells (neighbors) in a computationally efficient and statistically tractable manner. The algorithm is based on the observation that across protocols, the technical noise exhibited by UMI-filtered scRNA-Seq data closely follows Poisson statistics. Smoothing is performed by first identifying the nearest neighbors of each cell in a step-wise fashion, based on partially smoothed and variance-stabilized expression profiles, and then aggregating their transcript counts. We show that kNN-smoothing greatly improves the detection of clusters of cells and co-expressed genes, and clearly outperforms other smoothing methods on simulated data. To accurately perform smoothing for datasets containing highly similar cell populations, we propose the kNN-smoothing 2 algorithm, in which neighbors are determined after projecting the partially smoothed data onto the first few principal components. We show that unlike its predecessor, kNN-smoothing 2 can accurately distinguish between cells from different T cell subsets, and enables their identification in peripheral blood using unsupervised methods. Our work facilitates the analysis of scRNA-Seq data across a broad range of applications, including the identification of cell populations in heterogeneous tissues and the characterization of dynamic processes such as cellular differentiation. Reference implementations of our algorithms can be found at https://github.com/yanailab/knn-smoothing.


2019 ◽  
Author(s):  
Teng Fei ◽  
Tianwei Yu

AbstractBatch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. We present scBatch, a numerical algorithm that conducts batch effect correction on the count matrix of RNA sequencing (RNA-seq) data. Different from traditional methods, scBatch starts with establishing an ideal correction of the sample distance matrix that effectively reflect the underlying biological subgroups, without considering the actual correction of the raw count matrix itself. It then seeks an optimal linear transformation of the count matrix to approximate the established sample pattern. The benefit of such an approach is the final result is not restricted by assumptions on the mechanism of the batch effect. As a result, the method yields good clustering and gene differential expression (DE) results. We compared the new method, scBatch, with leading batch effect removal methods ComBat and mnnCorrect on simulated data, real bulk RNA-seq data, and real single-cell RNA-seq data. The comparisons demonstrated that scBatch achieved better sample clustering and DE gene detection results.


Genes ◽  
2020 ◽  
Vol 11 (4) ◽  
pp. 377 ◽  
Author(s):  
Maryam Zand ◽  
Jianhua Ruan

Single-cell RNA sequencing is a powerful technology for obtaining transcriptomes at single-cell resolutions. However, it suffers from dropout events (i.e., excess zero counts) since only a small fraction of transcripts get sequenced in each cell during the sequencing process. This inherent sparsity of expression profiles hinders further characterizations at cell/gene-level such as cell type identification and downstream analysis. To alleviate this dropout issue we introduce a network-based method, netImpute, by leveraging the hidden information in gene co-expression networks to recover real signals. netImpute employs Random Walk with Restart (RWR) to adjust the gene expression level in a given cell by borrowing information from its neighbors in a gene co-expression network. Performance evaluation and comparison with existing tools on simulated data and seven real datasets show that netImpute substantially enhances clustering accuracy and data visualization clarity, thanks to its effective treatment of dropouts. While the idea of netImpute is general and can be applied with other types of networks such as cell co-expression network or protein–protein interaction (PPI) network, evaluation results show that gene co-expression network is consistently more beneficial, presumably because PPI network usually lacks cell type context, while cell co-expression network can cause information loss for rare cell types. Evaluation results on several biological datasets show that netImpute can more effectively recover missing transcripts in scRNA-seq data and enhance the identification and visualization of heterogeneous cell types than existing methods.


Author(s):  
R.O. Aliev ◽  
N.M. Borisov

As early as in 2002, the need was declared for a public repository of experimental results for gene expression profiling. Since that time, several storage hubs for gene expression profiling data have been created, to enable profile analysis and comparison. This gene expression profiling may usually be performed using either mRNA microarray hybridization ornext-generation sequencing. However, all these big data may be heterogeneous, even if they were obtained for the same type of normal or pathologically altered organs and tissues, and have been investigated using the same experimental platform. In the current work, we have proposed a new method for analyzing the homogeneity of expression data based on the Student test. Using computational experiments, we have shown the advantage of our method in terms of computational speed for large datasets, and developed an approach to interpreting the results for the Student test application. Using a new method of data analysis, we have suggested a scheme for visualization of the overall picture of gene expression and comparison of expression profiles at different diseases and/or different stages of the same disease.


Sign in / Sign up

Export Citation Format

Share Document