scholarly journals ELeFHAnt: A supervised machine learning approach for label harmonization and annotation of single cell RNA-seq data

2021 ◽  
Author(s):  
Konrad Thorner ◽  
Aaron M. Zorn ◽  
Praneet Chaturvedi

AbstractAnnotation of single cells has become an important step in the single cell analysis framework. With advances in sequencing technology thousands to millions of cells can be processed to understand the intricacies of the biological system in question. Annotation through manual curation of markers based on a priori knowledge is cumbersome given this exponential growth. There are currently ~200 computational tools available to help researchers automatically annotate single cells using supervised/unsupervised machine learning, cell type markers, or tissue-based markers from bulk RNA-seq. But with the expansion of publicly available data there is also a need for a tool which can help integrate multiple references into a unified atlas and understand how annotations between datasets compare. Here we present ELeFHAnt: Ensemble learning for harmonization and annotation of single cells. ELeFHAnt is an easy-to-use R package that employs support vector machine and random forest algorithms together to perform three main functions: 1) CelltypeAnnotation 2) LabelHarmonization 3) DeduceRelationship. CelltypeAnnotation is a function to annotate cells in a query Seurat object using a reference Seurat object with annotated cell types. LabelHarmonization can be utilized to integrate multiple cell atlases (references) into a unified cellular atlas with harmonized cell types. Finally, DeduceRelationship is a function that compares cell types between two scRNA-seq datasets. ELeFHAnt can be accessed from GitHub at https://github.com/praneet1988/ELeFHAnt.

2018 ◽  
Author(s):  
Changlin Wan ◽  
Wennan Chang ◽  
Yu Zhang ◽  
Fenil Shah ◽  
Xiaoyu Lu ◽  
...  

ABSTRACTA key challenge in modeling single-cell RNA-seq (scRNA-seq) data is to capture the diverse gene expression states regulated by different transcriptional regulatory inputs across single cells, which is further complicated by a large number of observed zero and low expressions. We developed a left truncated mixture Gaussian (LTMG) model that stems from the kinetic relationships between the transcriptional regulatory inputs and metabolism of mRNA and gene expression abundance in a cell. LTMG infers the expression multi-modalities across single cell entities, representing a gene’s diverse expression states; meanwhile the dropouts and low expressions are treated as left truncated, specifically representing an expression state that is under suppression. We demonstrated that LTMG has significantly better goodness of fitting on an extensive number of single-cell data sets, comparing to three other state of the art models. In addition, our systems kinetic approach of handling the low and zero expressions and correctness of the identified multimodality are validated on several independent experimental data sets. Application on data of complex tissues demonstrated the capability of LTMG in extracting varied expression states specific to cell types or cell functions. Based on LTMG, a differential gene expression test and a co-regulation module identification method, namely LTMG-DGE and LTMG-GCR, are further developed. We experimentally validated that LTMG-DGE is equipped with higher sensitivity and specificity in detecting differentially expressed genes, compared with other five popular methods, and that LTMG-GCR is capable to retrieve the gene co-regulation modules corresponding to perturbed transcriptional regulations. A user-friendly R package with all the analysis power is available at https://github.com/zy26/LTMGSCA.


2021 ◽  
Author(s):  
Diogo M Ribeiro ◽  
Chaymae Ziyani ◽  
Olivier Delaneau

Most human genes are co-expressed with a nearby gene. Yet, previous studies only reported this extensive local gene co-expression using bulk RNA-seq. Here, we leverage single cell datasets in >85 individuals to identify gene co-expression across cells, unbiased by cell type heterogeneity and benefiting from the co-occurrence of transcription events in single cells. We discover thousands of co-expressed genes in two cell types and (i) compare single cell to bulk RNA-seq in identifying local gene co-expression, (ii) show that many co-expressed genes – but not the majority – are composed of functionally-related genes and (iii) provide evidence that these genes are transcribed synchronously and their co-expression is maintained up to the protein level. Finally, we identify gene-enhancer associations using multimodal single cell data, which reveal that >95% of co-expressed gene pairs share regulatory elements. Our in-depth view of local gene co-expression and regulatory element co-activity advances our understanding of the shared regulatory architecture between genes.


2019 ◽  
Vol 47 (18) ◽  
pp. e111-e111 ◽  
Author(s):  
Changlin Wan ◽  
Wennan Chang ◽  
Yu Zhang ◽  
Fenil Shah ◽  
Xiaoyu Lu ◽  
...  

Abstract A key challenge in modeling single-cell RNA-seq data is to capture the diversity of gene expression states regulated by different transcriptional regulatory inputs across individual cells, which is further complicated by largely observed zero and low expressions. We developed a left truncated mixture Gaussian (LTMG) model, from the kinetic relationships of the transcriptional regulatory inputs, mRNA metabolism and abundance in single cells. LTMG infers the expression multi-modalities across single cells, meanwhile, the dropouts and low expressions are treated as left truncated. We demonstrated that LTMG has significantly better goodness of fitting on an extensive number of scRNA-seq data, comparing to three other state-of-the-art models. Our biological assumption of the low non-zero expressions, rationality of the multimodality setting, and the capability of LTMG in extracting expression states specific to cell types or functions, are validated on independent experimental data sets. A differential gene expression test and a co-regulation module identification method are further developed. We experimentally validated that our differential expression test has higher sensitivity and specificity, compared with other five popular methods. The co-regulation analysis is capable of retrieving gene co-regulation modules corresponding to perturbed transcriptional regulations. A user-friendly R package with all the analysis power is available at https://github.com/zy26/LTMGSCA.


2019 ◽  
Author(s):  
Marcus Alvarez ◽  
Elior Rahmani ◽  
Brandon Jew ◽  
Kristina M. Garske ◽  
Zong Miao ◽  
...  

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.


2020 ◽  
Author(s):  
Etienne Becht ◽  
Daniel Tolstrup ◽  
Charles-Antoine Dutertre ◽  
Florent Ginhoux ◽  
Evan W. Newell ◽  
...  

AbstractModern immunologic research increasingly requires high-dimensional analyses in order to understand the complex milieu of cell-types that comprise the tissue microenvironments of disease. To achieve this, we developed Infinity Flow combining hundreds of overlapping flow cytometry panels using machine learning to enable the simultaneous analysis of the co-expression patterns of 100s of surface-expressed proteins across millions of individual cells. In this study, we demonstrate that this approach allows the comprehensive analysis of the cellular constituency of the steady-state murine lung and to identify novel cellular heterogeneity in the lungs of melanoma metastasis bearing mice. We show that by using supervised machine learning, Infinity Flow enhances the accuracy and depth of clustering or dimensionality reduction algorithms. Infinity Flow is a highly scalable, low-cost and accessible solution to single cell proteomics in complex tissues.


2017 ◽  
Author(s):  
Zhun Miao ◽  
Ke Deng ◽  
Xiaowo Wang ◽  
Xuegong Zhang

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.


2020 ◽  
Vol 52 (10) ◽  
pp. 468-477
Author(s):  
Alexander C. Zambon ◽  
Tom Hsu ◽  
Seunghee Erin Kim ◽  
Miranda Klinck ◽  
Jennifer Stowe ◽  
...  

Much of our understanding of the regulatory mechanisms governing the cell cycle in mammals has relied heavily on methods that measure the aggregate state of a population of cells. While instrumental in shaping our current understanding of cell proliferation, these approaches mask the genetic signatures of rare subpopulations such as quiescent (G0) and very slowly dividing (SD) cells. Results described in this study and those of others using single-cell analysis reveal that even in clonally derived immortalized cancer cells, ∼1–5% of cells can exhibit G0 and SD phenotypes. Therefore to enable the study of these rare cell phenotypes we established an integrated molecular, computational, and imaging approach to track, isolate, and genetically perturb single cells as they proliferate. A genetically encoded cell-cycle reporter (K67p-FUCCI) was used to track single cells as they traversed the cell cycle. A set of R-scripts were written to quantify K67p-FUCCI over time. To enable the further study G0 and SD phenotypes, we retrofitted a live cell imaging system with a micromanipulator to enable single-cell targeting for functional validation studies. Single-cell analysis revealed HT1080 and MCF7 cells had a doubling time of ∼24 and ∼48 h, respectively, with high duration variability in G1 and G2 phases. Direct single-cell microinjection of mRNA encoding (GFP) achieves detectable GFP fluorescence within ∼5 h in both cell types. These findings coupled with the possibility of targeting several hundreds of single cells improves throughput and sensitivity over conventional methods to study rare cell subpopulations.


Author(s):  
Massimo Andreatta ◽  
Santiago J Carmona

Abstract Summary STACAS is a computational method for the identification of integration anchors in the Seurat environment, optimized for the integration of single-cell (sc) RNA-seq datasets that share only a subset of cell types. We demonstrate that by (i) correcting batch effects while preserving relevant biological variability across datasets, (ii) filtering aberrant integration anchors with a quantitative distance measure and (iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. Availability and implementation Source code and R package available at https://github.com/carmonalab/STACAS; Docker image available at https://hub.docker.com/repository/docker/mandrea1/stacas_demo.


2020 ◽  
Author(s):  
Siamak Yousefi ◽  
Hao Chen ◽  
Jesse F. Ingels ◽  
Melinda S. McCarty ◽  
Arthur G. Centeno ◽  
...  

SUMMARYSingle cell RNA sequencing has enabled quantification of single cells and identification of different cell types and subtypes as well as cell functions in different tissues. Single cell RNA sequence analyses assume acquired RNAs correspond to cells, however, RNAs from contamination within the input data are also captured by these assays. The sequencing of background contamination as well as unwanted cells making their way to the final assay Potentially confound the correct biological interpretation of single cell transcriptomic data. Here we demonstrate two approaches to deal with background contamination as well as profiling of unwanted cells in the assays. We use three real-life datasets of whole-cell capture and nucleotide single-cell captures generated by Fluidigm and 10x technologies and show that these methods reduce the effect of contamination, strengthen clustering of cells and improves biological interpretation.


2020 ◽  
Author(s):  
Jeremy Lombardo ◽  
Marzieh Aliaghaei ◽  
Quy Nguyen ◽  
Kai Kessenbrock ◽  
Jered Haun

Abstract Tissues are composed of highly heterogeneous mixtures of cell subtypes, and this diversity is increasingly being characterized using high-throughput single cell analysis methods. However, these efforts are hindered by the fact that tissues must first be dissociated into single cell suspensions that are viable and still accurately represent phenotypes from the original tissue. Current methods for breaking down tissues are inefficient, labor-intensive, subject to high variability, and potentially biased towards cell subtypes that are easier to release. Here, we present a microfluidic platform consisting of three different tissue processing technologies that can perform the complete tissue to single cell workflow, including digestion, disaggregation, and filtration. First, we developed a new microfluidic digestion device that can be loaded with minced tissue specimens quickly and easily, and then use the combination of proteolytic enzyme activity and fluid shear forces to accelerate tissue breakdown. Next, we integrated dissociation and filter technologies into a single device, which enhanced single cell numbers and fully prepared the sample for single cell analysis. The final multi-device platform was then evaluated using a diverse array of tissue types that exhibited a wide range of properties. For murine kidney and mammary tumor, we found that microfluidic processing produced 2.5-fold more single, viable cells. Single cell RNA sequencing (scRNA-seq) further revealed that device processing enriched for endothelial cells, fibroblasts, and basal epithelium, and did not increase stress responses. For murine liver and heart, which are softer tissues containing fragile cell types, processing time could be reduced to 15 min, and even as short as 1 min. We also demonstrated that periodic recovery at defined time intervals produced substantially more hepatocytes and cardiomyocytes than continuous operation, most likely by preventing damage to fragile cell types. In future work, we will seek to integrate additional operations such as upstream tissue preparation and downstream microfluidic cell sorting and detection to create powerful point-of-care single cell diagnostic platforms.


Sign in / Sign up

Export Citation Format

Share Document