scholarly journals Clustering trees: a visualisation for evaluating clusterings at multiple resolutions

2018 ◽  
Author(s):  
Luke Zappia ◽  
Alicia Oshlack

AbstractClustering techniques are widely used in the analysis of large data sets to group together samples with similar properties. For example, clustering is often used in the field of single-cell RNA-sequencing in order to identify different cell types present in a tissue sample. There are many algorithms for performing clustering and the results can vary substantially. In particular, the number of groups present in a data set is often unknown and the number of clusters identified by an algorithm can change based on the parameters used. To explore and examine the impact of varying clustering resolution we present clustering trees. This visualisation shows the relationships between clusters at multiple resolutions allowing researchers to see how samples move as the number of clusters increases. In addition, meta-information can be overlaid on the tree to inform the choice of resolution and guide in identification of clusters. We illustrate the features of clustering trees using a series of simulations as well as two real examples, the classical iris dataset and a complex single-cell RNA-sequencing dataset. Clustering trees can be produced using the clustree R package available from CRAN (https://CRAN.R-project.org/package=clustree) and developed on GitHub (https://github.com/lazappi/clustree).

2021 ◽  
Author(s):  
Elnaz Mirzaei Mehrabad ◽  
Aditya Bhaskara ◽  
Benjamin T. Spike

AbstractMotivationSingle cell RNA sequencing (scRNA-seq) is a powerful gene expression profiling technique that is presently revolutionizing the study of complex cellular systems in the biological sciences. Existing single-cell RNA-sequencing methods suffer from sub-optimal target recovery leading to inaccurate measurements including many false negatives. The resulting ‘zero-inflated’ data may confound data interpretation and visualization.ResultsSince cells have coherent phenotypes defined by conserved molecular circuitries (i.e. multiple gene products working together) and since similar cells utilize similar circuits, information about each each expression value or ‘node’ in a multi-cell, multi-gene scRNA-Seq data set is expected to also be predictable from other nodes in the data set. Based on this logic, several approaches have been proposed to impute missing values by extracting information from non-zero measurements in a data set. In this study, we applied non-negative matrix factorization approaches to a selection of published scRNASeq data sets to recommend new values where original measurements are likely to be inaccurate and where ‘zero’ measurements are predicted to be false negatives. The resulting imputed data model predicts novel cell type markers and expression patterns more closely matching gene expression values from orthogonal measurements and/or predicted literature than the values obtained from other previously published imputation [email protected] and implementationFIESTA is written in R and is available at https://github.com/elnazmirzaei/FIESTA and https://github.com/TheSpikeLab/FIESTA.


2019 ◽  
Author(s):  
Imad Abugessaisa ◽  
Shuhei Noguchi ◽  
Melissa Cardon ◽  
Akira Hasegawa ◽  
Kazuhide Watanabe ◽  
...  

AbstractAnalysis and interpretation of single-cell RNA-sequencing (scRNA-seq) experiments are compromised by the presence of poor quality cells. For meaningful analyses, such poor quality cells should be excluded to avoid biases and large variation. However, no clear guidelines exist. We introduce SkewC, a novel quality-assessment method to identify poor quality single-cells in scRNA-seq experiments. The method is based on the assessment of gene coverage for each single cell and its skewness as a quality measure. To validate the method, we investigated the impact of poor quality cells on downstream analyses and compared biological differences between typical and poor quality cells. Moreover, we measured the ratio of intergenic expression, suggesting genomic contamination, and foreign organism contamination of single-cell samples. SkewC is tested in 37,993 single-cells generated by 15 scRNA-seq protocols. We envision SkewC as an indispensable QC method to be incorporated into scRNA-seq experiment to preclude the possibility of scRNA-seq data misinterpretation.


2019 ◽  
Author(s):  
Katelyn Donahue ◽  
Yaqing Zhang ◽  
Veerin Sirihorachai ◽  
Stephanie The ◽  
Arvind Rao ◽  
...  

2019 ◽  
Author(s):  
Daniel Osorio ◽  
Xue Yu ◽  
Peng Yu ◽  
Erchin Serpedin ◽  
James J. Cai

AbstractIn biomedical research, lymphoblastoid cell lines (LCLs), often established byin vitroinfection of resting B cells with Epstein Barr Virus, are commonly used as surrogates for peripheral blood lymphocytes. Genomic and transcriptomic information on LCLs has been used to study the impact of genetic variation on gene expression in humans. Here we present single-cell RNA sequencing (scRNA-seq) data on GM12878 and GM18502—two LCLs derived from the blood of female donors of European and African ancestry, respectively. Cells from three samples (the two LCLs and a 1:1 mixture of the two) were prepared separately using a 10X Genomics Chromium Controller and deeply sequenced. The final dataset contained 7,045 cells from GM12878, 5,189 from GM18502, and 5,820 from the mixture, offering valuable information on single-cell gene expression in highly homogenous cell populations. This dataset is a suitable reference of population differentiation in gene expression at the single-cell level. Data from the mixture provides additional valuable information facilitating the development of statistical methods for data normalization and batch effect correction.


2021 ◽  
Author(s):  
Ariel A. Hippen ◽  
Matias M. Falco ◽  
Lukas M. Weber ◽  
Erdogan Pekcan Erkan ◽  
Kaiyang Zhang ◽  
...  

AbstractMotivationSingle-cell RNA-sequencing (scRNA-seq) has made it possible to profile gene expression in tissues at high resolution. An important preprocessing step prior to performing downstream analyses is to identify and remove cells with poor or degraded sample quality using quality control (QC) metrics. Two widely used QC metrics to identify a ‘low-quality’ cell are (i) if the cell includes a high proportion of reads that map to mitochondrial DNA (mtDNA) encoded genes and (ii) if a small number of genes are detected. Current best practices use these QC metrics independently with either arbitrary, uniform thresholds (e.g. 5%) or biological context-dependent (e.g. species) thresholds, and fail to jointly model these metrics in a data-driven manner. Current practices are often overly stringent and especially untenable on lower-quality tissues, such as archived tumor tissues.ResultsWe propose a data-driven QC metric (miQC) that jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset. We demonstrate how our QC metric easily adapts to different types of single-cell datasets to remove low-quality cells while preserving high-quality cells that can be used for downstream analyses.AvailabilitySoftware available at https://github.com/greenelab/miQC. The code used to download datasets, perform the analyses, and reproduce the figures is available at https://github.com/greenelab/mito-filtering.ContactStephanie C. Hicks ([email protected]) and Anna Vähärautio ([email protected])


Author(s):  
Wesley T Abplanalp ◽  
David John ◽  
Sebastian Cremer ◽  
Birgit Assmus ◽  
Lena Dorsheimer ◽  
...  

Abstract Aims Identification of signatures of immune cells at single-cell level may provide novel insights into changes of immune-related disorders. Therefore, we used single-cell RNA-sequencing to determine the impact of heart failure on circulating immune cells. Methods and results We demonstrate a significant change in monocyte to T-cell ratio in patients with heart failure, compared to healthy subjects, which were validated by flow cytometry analysis. Subclustering of monocytes and stratification of the clusters according to relative CD14 and FCGR3A (CD16) expression allowed annotation of classical, intermediate, and non-classical monocytes. Heart failure had a specific impact on the gene expression patterns in these subpopulations. Metabolically active genes such as FABP5 were highly enriched in classical monocytes of heart failure patients, whereas β-catenin expression was significantly higher in intermediate monocytes. The selective regulation of signatures in the monocyte subpopulations was validated by classical and multifactor dimensionality reduction flow cytometry analyses. Conclusion Together this study shows that circulating cells derived from patients with heart failure have altered phenotypes. These data provide a rich source for identification of signatures of immune cells in heart failure compared to healthy subjects. The observed increase in FABP5 and signatures of Wnt signalling may contribute to enhanced monocyte activation.


2020 ◽  
Author(s):  
Wanqiu Chen ◽  
Yongmei Zhao ◽  
Xin Chen ◽  
Xiaojiang Xu ◽  
Zhaowei Yang ◽  
...  

AbstractSingle-cell RNA sequencing (scRNA-seq) has become a very powerful technology for biomedical research and is becoming much more affordable as methods continue to evolve, but it is unknown how reproducible different platforms are using different bioinformatics pipelines, particularly the recently developed scRNA-seq batch correction algorithms. We carried out a comprehensive multi-center cross-platform comparison on different scRNA-seq platforms using standard reference samples. We compared six pre-processing pipelines, seven bioinformatics normalization procedures, and seven batch effect correction methods including CCA, MNN, Scanorama, BBKNN, Harmony, limma and ComBat to evaluate the performance and reproducibility of 20 scRNA-seq data sets derived from four different platforms and centers. We benchmarked scRNA-seq performance across different platforms and testing sites using global gene expression profiles as well as some cell-type specific marker genes. We showed that there were large batch effects; and the reproducibility of scRNA-seq across platforms was dictated both by the expression level of genes selected and the batch correction methods used. We found that CCA, MNN, and BBKNN all corrected the batch variations fairly well for the scRNA-seq data derived from biologically similar samples across platforms/sites. However, for the scRNA-seq data derived from or consisting of biologically distinct samples, limma and ComBat failed to correct batch effects, whereas CCA over-corrected the batch effect and misclassified the cell types and samples. In contrast, MNN, Harmony and BBKNN separated biologically different samples/cell types into correspondingly distinct dimensional subspaces; however, consistent with this algorithm’s logic, MNN required that the samples evaluated each contain a shared portion of highly similar cells. In summary, we found a great cross-platform consistency in separating two distinct samples when an appropriate batch correction method was used. We hope this large cross-platform/site scRNA-seq data set will provide a valuable resource, and that our findings will offer useful advice for the single-cell sequencing community.


2021 ◽  
Author(s):  
Michael E Nelson ◽  
Simone G Riva ◽  
Ann Cvejic

Spatial transcriptomics is revolutionising the study of single-cell RNA and tissue-wide cell heterogeneity, but few robust methods connecting spatially resolved cells to so-called marker genes from single-cell RNA sequencing, which generate significant insight gleaned from spatial methods, exist. Here we present SMaSH, a general computational framework for extracting key marker genes from single-cell RNA sequencing data for spatial transcriptomics approaches. SMaSH extracts robust and biologically well-motivated marker genes, which characterise the given data-set better than existing and limited computational approaches for global marker gene calculation.


2019 ◽  
Author(s):  
Haruka Ozaki ◽  
Tetsutaro Hayashi ◽  
Mana Umeda ◽  
Itoshi Nikaido

AbstractBackgroundRead coverage of RNA sequencing data reflects gene expression and RNA processing events. Single-cell RNA sequencing (scRNA-seq) methods, particularly “full-length” ones, provide read coverage of many individual cells and have the potential to reveal cellular heterogeneity in RNA transcription and processing. However, visualization tools suited to highlighting cell-to-cell heterogeneity in read coverage are still lacking.ResultsHere, we have developed Millefy, a tool for visualizing read coverage of scRNA-seq data in genomic contexts. Millefy is designed to show read coverage of all individual cells at once in genomic contexts and to highlight cell-to-cell heterogeneity in read coverage. By visualizing read coverage of all cells as a heat map and dynamically reordering cells based on diffusion maps, Millefy facilitates discovery of “local” region-specific, cell-to-cell heterogeneity in read coverage, including variability of transcribed regions.ConclusionsMillefy simplifies the examination of cellular heterogeneity in RNA transcription and processing events using scRNA-seq data. Millefy is available as an R package (https://github.com/yuifu/millefy) and a Docker image to help use Millefy on the Jupyter notebook (https://hub.docker.com/r/yuifu/datascience-notebook-millefy).


Author(s):  
Wenhui Xie ◽  
Yilang Ke ◽  
Qinyi You ◽  
Jing Li ◽  
Lu Chen ◽  
...  

Objective: The impact of vascular aging on cardiovascular diseases has been extensively studied; however, little is known regarding the cellular and molecular mechanisms underlying age-related vascular aging in aortic cellular subpopulations. Approach and Results: Transcriptomes and transposase-accessible chromatin profiles from the aortas of 4-, 26-, and 86-week-old C57/BL6J mice were analyzed using single-cell RNA sequencing and assay for transposase-accessible chromatin sequencing. By integrating the heterogeneous transcriptome and chromatin accessibility data, we identified cell-specific TF (transcription factor) regulatory networks and open chromatin states. We also determined that aortic aging affects cell interactions, inflammation, cell type composition, dysregulation of transcriptional control, and chromatin accessibility. Endothelial cells 1 have higher gene set activity related to cellular senescence and aging than do endothelial cells 2. Moreover, construction of senescence trajectories shows that endothelial cell 1 and fibroblast senescence is associated with distinct TF open chromatin states and an mRNA expression model. Conclusions: Our data provide a system-wide model for transcriptional and epigenetic regulation during aortic aging at single-cell resolution.


Sign in / Sign up

Export Citation Format

Share Document