scholarly journals Utilizing Scp for the analysis and replication of single-cell proteomics data

Author(s):  
Christophe Vanderaa ◽  
Laurent Gatto

AbstractIntroductionMass spectrometry-based proteomics is actively embracing quantitative, single cell-level analyses. Indeed, recent advances in sample preparation and mass spectrometry (MS) have enabled the emergence of quantitative MS-based single-cell proteomics (SCP). While exciting and promising, SCP still has many rough edges. The current analysis workflows are custom and build from scratch. The field is therefore craving for standardized software that promotes principled and reproducible SCP data analyses.Areas coveredThis special report represents the first step toward the formalization of standard SCP data analysis. Scp, the software that accompanies this work can successfully reproduces one of the landmark data in the field of SCP. We created a repository containing the reproduction workflow with comprehensive documentation in order to favor further dissemination and improvement of SCP data analyses.Expert opinionReproducing SCP data analyses uncovers important challenges in SCP data analysis. We describe two such challenges in detail: batch correction and data missingness. We provide the current state-of-the-art and illustrate the associated limitations. We also highlights the intimate dependence that exists between batch effects and data missingness and provides future tracks for dealing with these exciting challenges.1Article highlightsSingle-cell proteomics (SCP) is emerging thanks to several recent technological advances, but further progress is lagging due to principled and systematic data analysis.This work offers a standardized solution for the processing of SCP data demonstrated by the reproduction of a landmark SCP work.Two important challenges remain: batch effects and data missingness. Furthermore, these challenges are not independent and therefore need to be modeled simultaneously.

2020 ◽  
Author(s):  
Ruben Chazarra-Gil ◽  
Stijn van Dongen ◽  
Vladimir Yu Kiselev ◽  
Martin Hemberg

AbstractAs the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.


2015 ◽  
Vol 112 (21) ◽  
pp. 6545-6550 ◽  
Author(s):  
Rosemary M. Onjiko ◽  
Sally A. Moody ◽  
Peter Nemes

Spatial and temporal changes in molecular expression are essential to embryonic development, and their characterization is critical to understand mechanisms by which cells acquire different phenotypes. Although technological advances have made it possible to quantify expression of large molecules during embryogenesis, little information is available on metabolites, the ultimate indicator of physiological activity of the cell. Here, we demonstrate that single-cell capillary electrophoresis-electrospray ionization mass spectrometry is able to test whether differential expression of the genome translates to the domain of metabolites between single embryonic cells. Dissection of three different cell types with distinct tissue fates from 16-cell embryos of the South African clawed frog (Xenopus laevis) and microextraction of their metabolomes enabled the identification of 40 metabolites that anchored interconnected central metabolic networks. Relative quantitation revealed that several metabolites were differentially active between the cell types in the wild-type, unperturbed embryos. Altering postfertilization cytoplasmic movements that perturb dorsal development confirmed that these three cells have characteristic small-molecular activity already at cleavage stages as a result of cell type and not differences in pigmentation, yolk content, cell size, or position in the embryo. Changing the metabolite concentration caused changes in cell movements at gastrulation that also altered the tissue fates of these cells, demonstrating that the metabolome affects cell phenotypes in the embryo.


2021 ◽  
Vol 12 ◽  
Author(s):  
Bin Zou ◽  
Tongda Zhang ◽  
Ruilong Zhou ◽  
Xiaosen Jiang ◽  
Huanming Yang ◽  
...  

It is well recognized that batch effect in single-cell RNA sequencing (scRNA-seq) data remains a big challenge when integrating different datasets. Here, we proposed deepMNN, a novel deep learning-based method to correct batch effect in scRNA-seq data. We first searched mutual nearest neighbor (MNN) pairs across different batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network was constructed by stacking two residual blocks and further applied for the removal of batch effects. The loss function of deepMNN was defined as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input. The experiment results showed that deepMNN can successfully remove batch effects across datasets with identical cell types, datasets with non-identical cell types, datasets with multiple batches, and large-scale datasets as well. We compared the performance of deepMNN with state-of-the-art batch correction methods, including the widely used methods of Harmony, Scanorama, and Seurat V4 as well as the recently developed deep learning-based methods of MMD-ResNet and scGen. The results demonstrated that deepMNN achieved a better or comparable performance in terms of both qualitative analysis using uniform manifold approximation and projection (UMAP) plots and quantitative metrics such as batch and cell entropies, ARI F1 score, and ASW F1 score under various scenarios. Additionally, deepMNN allowed for integrating scRNA-seq datasets with multiple batches in one step. Furthermore, deepMNN ran much faster than the other methods for large-scale datasets. These characteristics of deepMNN made it have the potential to be a new choice for large-scale single-cell gene expression data analysis.


2020 ◽  
Author(s):  
Wanqiu Chen ◽  
Yongmei Zhao ◽  
Xin Chen ◽  
Xiaojiang Xu ◽  
Zhaowei Yang ◽  
...  

AbstractSingle-cell RNA sequencing (scRNA-seq) has become a very powerful technology for biomedical research and is becoming much more affordable as methods continue to evolve, but it is unknown how reproducible different platforms are using different bioinformatics pipelines, particularly the recently developed scRNA-seq batch correction algorithms. We carried out a comprehensive multi-center cross-platform comparison on different scRNA-seq platforms using standard reference samples. We compared six pre-processing pipelines, seven bioinformatics normalization procedures, and seven batch effect correction methods including CCA, MNN, Scanorama, BBKNN, Harmony, limma and ComBat to evaluate the performance and reproducibility of 20 scRNA-seq data sets derived from four different platforms and centers. We benchmarked scRNA-seq performance across different platforms and testing sites using global gene expression profiles as well as some cell-type specific marker genes. We showed that there were large batch effects; and the reproducibility of scRNA-seq across platforms was dictated both by the expression level of genes selected and the batch correction methods used. We found that CCA, MNN, and BBKNN all corrected the batch variations fairly well for the scRNA-seq data derived from biologically similar samples across platforms/sites. However, for the scRNA-seq data derived from or consisting of biologically distinct samples, limma and ComBat failed to correct batch effects, whereas CCA over-corrected the batch effect and misclassified the cell types and samples. In contrast, MNN, Harmony and BBKNN separated biologically different samples/cell types into correspondingly distinct dimensional subspaces; however, consistent with this algorithm’s logic, MNN required that the samples evaluated each contain a shared portion of highly similar cells. In summary, we found a great cross-platform consistency in separating two distinct samples when an appropriate batch correction method was used. We hope this large cross-platform/site scRNA-seq data set will provide a valuable resource, and that our findings will offer useful advice for the single-cell sequencing community.


2019 ◽  
Author(s):  
Yiliang Zhang ◽  
Kexuan Liang ◽  
Molei Liu ◽  
Yue Li ◽  
Hao Ge ◽  
...  

AbstractSingle-cell RNA sequencing technologies are widely used in recent years as a powerful tool allowing the observation of gene expression at the resolution of single cells. Two of the major challenges in scRNA-seq data analysis are dropout events and batch effects. The inflation of zero(dropout rate) varies substantially across single cells. Evidence has shown that technical noise, including batch effects, explains a notable proportion of this cell-to-cell variation. To capture biological variation, it is necessary to quantify and remove technical variation. Here, we introduce SCRIBE (Single-Cell Recovery Imputation with Batch Effects), a principled framework that imputes dropout events and corrects batch effects simultaneously. We demonstrate, through real examples, that SCRIBE outperforms existing scRNA-seq data analysis tools in recovering cell-specific gene expression patterns, removing batch effects and retaining biological variation across cells. Our software is freely available online at https://github.com/YiliangTracyZhang/SCRIBE.


2020 ◽  
Author(s):  
Miroslav Kratochvíl ◽  
Oliver Hunewald ◽  
Laurent Heirendt ◽  
Vasco Verissimo ◽  
Jiří Vondrášek ◽  
...  

AbstractBackgroundThe amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow to easily generate data with hundreds of millions of single-cell data points with more than 40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to down-sample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena.ResultsWe present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality-reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community, and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study.ConclusionsGigaSOM.jl facilitates utilization of the commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from an massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.Key pointsGigaSOM.jl improves the applicability of FlowSOM-style single-cell cytometry data analysis by increasing the acceptable dataset size to billions of single cells.Significant speedup over current methods is achieved by distributed processing and utilization of efficient algorithms.GigaSOM.jl package includes support for fast visualization of multidimensional data.


2021 ◽  
Author(s):  
Claudia Ctortecka ◽  
Gabriela Krššáková ◽  
Karel Stejskal ◽  
Josef M. Penninger ◽  
Sasha Mendjan ◽  
...  

AbstractSingle cell transcriptomics has revolutionized our understanding of basic biology and disease. Since transcript levels often do not correlate with protein expression, it is crucial to complement transcriptomics approaches with proteome analyses at single cell resolution. Despite continuous technological improvements in sensitivity, mass spectrometry-based single cell proteomics ultimately faces the challenge of reproducibly comparing the protein expression profiles of thousands of individual cells. Here, we combine two hitherto opposing analytical strategies, DIA and Tandem-Mass-Tag (TMT)-multiplexing, to generate highly reproducible, quantitative proteome signatures from ultra-low input samples. While conventional, data-dependent shotgun proteomics (DDA) of ultra-low input samples critically suffers from the accumulation of missing values with increasing sample-cohort size, data-independent acquisition (DIA) strategies do usually not take full advantage of isotope-encoded sample multiplexing. We developed a novel, identification-independent proteomics data-analysis pipeline that allows to quantitatively compare DIA-TMT proteome signatures across hundreds of samples independent of their biological origin, and to identify cell types and single protein knockouts. We validate our approach using integrative data analysis of different human cell lines and standard database searches for knockouts of defined proteins. These data establish a novel and reproducible approach to markedly expand the numbers of proteins one detects from ultra-low input samples, such as single cells.


2020 ◽  
Author(s):  
Soroor Hediyeh-zadeh ◽  
Andrew I. Webb ◽  
Melissa J. Davis

AbstractRecent developments in mass spectrometry (MS) instruments and data acquisition modes have aided multiplexed, fast, reproducible and quantitative analysis of proteome profiles, yet missing values remain a formidable challenge for proteomics data analysis. The stochastic nature of sampling in Data Dependent Acquisition (DDA), suboptimal preprocessing of Data Independent Acquisition (DIA) runs and dynamic range limitation of MS instruments impedes the reproducibility and accuracy of peptide quantification and can introduce systematic patterns of missingness that impact downstream analyses. Thus, imputation of missing values becomes an important element of data analysis. We introduce msImpute, an imputation method based on low-rank approximation, and compare it to six alternative imputation methods using public DDA and DIA datasets. We evaluate the performance of methods by determining the error of imputed values and accuracy of detection of differential expression. We also measure the post-imputation preservation of structures in the data at different levels of granularity. We develop a visual diagnostic to determine the nature of missingness in datasets based on peptides with high biological dropout rate and introduce a method to identify such peptides. Our findings demonstrate that msImpute performs well when data are missing at random and highlights the importance of prior knowledge about nature of missing values in a dataset when selecting an imputation technique.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Yoric Gagnebin ◽  
David A. Jaques ◽  
Serge Rudaz ◽  
Sophie de Seigneux ◽  
Julien Boccard ◽  
...  

Abstract Chronic kidney disease (CKD) is characterized by retention of uremic solutes. Compared to patients with non-dialysis dependent CKD, those requiring haemodialysis (HD) have increased morbidity and mortality. We wished to characterise metabolic patterns in CKD compared to HD patients using metabolomics. Prevalent non-HD CKD KDIGO stage 3b–4 and stage 5 HD outpatients were screened at a single tertiary hospital. Various liquid chromatography approaches hyphenated with mass spectrometry were used to identify 278 metabolites. Unsupervised and supervised data analyses were conducted to characterize metabolic patterns. 69 patients were included in the CKD group and 35 in the HD group. Unsupervised data analysis showed clear clustering of CKD, pre-dialysis (preHD) and post-dialysis (postHD) patients. Supervised data analysis revealed qualitative as well as quantitative differences in individual metabolites profiles between CKD, preHD and postHD states. An original metabolomics framework could discriminate between CKD stages and highlight HD effect based on 278 identified metabolites. Significant differences in metabolic patterns between CKD and HD patients were found overall as well as for specific metabolites. Those findings could explain clinical discrepancies between patients requiring HD and those with earlier stage of CKD.


Sign in / Sign up

Export Citation Format

Share Document