scholarly journals coupleCoC+: An information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data

2021 ◽  
Vol 17 (6) ◽  
pp. e1009064
Author(s):  
Pengcheng Zeng ◽  
Zhixiang Lin

Technological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, mouse cortex sc-methylation and scRNA-seq data, and human blood dendritic cells scRNA-seq data from two batches, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC_plus.

2021 ◽  
Author(s):  
Pengcheng Zeng ◽  
Zhixiang Lin

AbstractTechnological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, and mouse cortex sc-methylation and scRNA-seq data, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC plus.


Author(s):  
Pengcheng Zeng ◽  
Jiaxuan Wangwu ◽  
Zhixiang Lin

Abstract Unsupervised methods, such as clustering methods, are essential to the analysis of single-cell genomic data. The most current clustering methods are designed for one data type only, such as single-cell RNA sequencing (scRNA-seq), single-cell ATAC sequencing (scATAC-seq) or sc-methylation data alone, and a few are developed for the integrative analysis of multiple data types. The integrative analysis of multimodal single-cell genomic data sets leverages the power in multiple data sets and can deepen the biological insight. In this paper, we propose a coupled co-clustering-based unsupervised transfer learning algorithm (coupleCoC) for the integrative analysis of multimodal single-cell data. Our proposed coupleCoC builds upon the information theoretic co-clustering framework. In co-clustering, both the cells and the genomic features are simultaneously clustered. Clustering similar genomic features reduces the noise in single-cell data and facilitates transfer of knowledge across single-cell datasets. We applied coupleCoC for the integrative analysis of scATAC-seq and scRNA-seq data, sc-methylation and scRNA-seq data and scRNA-seq data from mouse and human. We demonstrate that coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. Our method coupleCoC is also computationally efficient and can scale up to large datasets. Availability: The software and datasets are available at https://github.com/cuhklinlab/coupleCoC.


2020 ◽  
Author(s):  
Pengcheng Zeng ◽  
Jiaxuan WangWu ◽  
Zhixiang Lin

AbstractUnsupervised methods, such as clustering methods, are essential to the analysis of single-cell genomic data. Most current clustering methods are designed for one data type only, such as scRNA-seq, scATAC-seq or sc-methylation data alone, and a few are developed for the integrative analysis of multiple data types. Integrative analysis of multimodal single-cell genomic data sets leverages the power in multiple data sets and can deepen the biological insight. We propose a coupled co-clustering-based unsupervised transfer learning algorithm (coupleCoC) for the integrative analysis of multimodal single-cell data. Our proposed coupleCoC builds upon the information theoretic co-clustering framework. We applied coupleCoC for the integrative analysis of scATAC-seq and scRNA-seq data, sc-methylation and scRNA-seq data, and scRNA-seq data from mouse and human. We demonstrate that coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic data sets. The software and data sets are available at https://github.com/cuhklinlab/coupleCoC.


2020 ◽  
Vol 3 (1) ◽  
Author(s):  
Patrick S. Stumpf ◽  
Xin Du ◽  
Haruka Imanishi ◽  
Yuya Kunisaki ◽  
Yuichiro Semba ◽  
...  

AbstractBiomedical research often involves conducting experiments on model organisms in the anticipation that the biology learnt will transfer to humans. Previous comparative studies of mouse and human tissues were limited by the use of bulk-cell material. Here we show that transfer learning—the branch of machine learning that concerns passing information from one domain to another—can be used to efficiently map bone marrow biology between species, using data obtained from single-cell RNA sequencing. We first trained a multiclass logistic regression model to recognize different cell types in mouse bone marrow achieving equivalent performance to more complex artificial neural networks. Furthermore, it was able to identify individual human bone marrow cells with 83% overall accuracy. However, some human cell types were not easily identified, indicating important differences in biology. When re-training the mouse classifier using data from human, less than 10 human cells of a given type were needed to accurately learn its representation. In some cases, human cell identities could be inferred directly from the mouse classifier via zero-shot learning. These results show how simple machine learning models can be used to reconstruct complex biology from limited data, with broad implications for biomedical research.


Author(s):  
Dengyu Xiao ◽  
Yixiang Huang ◽  
Chengjin Qin ◽  
Zhiyu Liu ◽  
Yanming Li ◽  
...  

Data-driven machinery fault diagnosis has gained much attention from academic research and industry to guarantee the machinery reliability. Traditional fault diagnosis frameworks are commonly under a default assumption: the training and test samples share the similar distribution. However, it is nearly impossible in real industrial applications, where the operating condition always changes over time and the quantity of the same-distribution samples is often not sufficient to build a qualified diagnostic model. Therefore, transfer learning, which possesses the capacity to leverage the knowledge learnt from the massive source data to establish a diagnosis model for the similar but small target data, has shown potential value in machine fault diagnosis with small sample size. In this paper, we propose a novel fault diagnosis framework for the small amount of target data based on transfer learning, using a modified TrAdaBoost algorithm and convolutional neural networks. First, the massive source data with different distributions is added to the target data as the training data. Then, a convolutional neural network is selected as the base learner and the modified TrAdaBoost algorithm is employed for the weight update of each training sample to form a stronger diagnostic model. The whole proposition is experimentally demonstrated and discussed by carrying out the tests of six three-phase induction motors under different operating conditions and fault types. Results show that compared with other methods, the proposed framework can achieve the highest fault diagnostic accuracy with inadequate target data.


2020 ◽  
Author(s):  
Tao Yang ◽  
Nicole Alessandri-Haber ◽  
Wen Fury ◽  
Michael Schaner ◽  
Robert Breese ◽  
...  

AbstractRNA sequencing technology promises an unprecedented opportunity in learning disease mechanisms and discovering new treatment targets. Recent spatial transcriptomics methods further enable the transcriptome profiling at spatially resolved spots in a tissue section. In controlled experiments, it is often of immense importance to know the cell composition in different samples. Understanding the cell type content in each tissue spot is also crucial to the spatial transcriptome data interpretation. Though single cell RNA-seq has the power to reveal cell type composition and expression heterogeneity in different cells, it remains costly and sometimes infeasible when live cells cannot be obtained or sufficiently dissociated. To computationally resolve the cell composition in RNA-seq data of mixed cells, we present AdRoit, an accurate androbust method to infer transcriptome composition. The method estimates the proportions of each cell type in the compound RNA-seq data using known single cell data of relevant cell types. It uniquely uses an adaptive learning approach to correct the bias gene-wise due to the difference in sequencing techniques. AdRoit also utilizes cell type specific genes while control their cross-sample variability. Our systematic benchmarking, spanning from simple to complex tissues, shows that AdRoit has superior sensitivity and specificity compared to other existing methods. Its performance holds for multiple single cell and compound RNA-seq platforms. In addition, AdRoit is computationally efficient and runs one to two orders of magnitude faster than some of the state-of-the-art methods.


2019 ◽  
Author(s):  
Umang Varma ◽  
Justin Colacino ◽  
Anna Gilbert

AbstractSingle cell RNA-sequencing (scRNA-seq) technologies have generated an expansive amount of new biological information, revealing new cellular populations and hierarchical relationships. A number of technologies complementary to scRNA-seq rely on the selection of a smaller number of marker genes (or features) to accurately differentiate cell types within a complex mixture of cells. In this paper, we benchmark differential expression methods against information-theoretic feature selection methods to evaluate the ability of these algorithms to identify small and efficient sets of genes that are informative about cell types. Unlike differential methods, that are strictly binary and univariate, information-theoretic methods can be used as any combination of binary or multiclass and univariate or multivariate. We show for some datasets, information theoretic methods can reveal genes that are both distinct from those selected by traditional algorithms and that are as informative, if not more, of the class labels. We also present detailed and principled theoretical analyses of these algorithms. All information theoretic methods in this paper are implemented in our PicturedRocks Python package that is compatible with the widely used scanpy package.


2020 ◽  
Author(s):  
John N. Weinstein ◽  
Mary A. Rohrdanz ◽  
Mark Stucky ◽  
James Melott ◽  
Jun Ma ◽  
...  

AbstractOmicPioneer-sc is an open-source data visualization/analysis package that integrates dimensionality-reduction plots (DRPs) such as t-SNE and UMAP with Next-Generation Clustered Heat Maps (NGCHMs) and Pathway Visualization Modules (PVMs) in a seamless, highly interactive exploratory environment. It includes fluent zooming and navigation, a statistical toolkit, dozens of link-outs to external public bioinformatic resources, high-resolution graphics that meet the requirements of all major journals, and the ability to store all metadata needed to reproduce the visualizations at a later time. A user-friendly, multi-panel graphical interface enables non-informaticians to interact with the system without programming, asking and answering questions that require navigation among the three types of modules or extension from them to the Gene Ontology or information on therapies. The visual integration can be useful for detective work to identify and annotate cell-types for color-coding of the DRPs, and multiple NGCHMs can be layered on top of each other (with toggling among them) as an aid to multi-omic analysis. The tools are available in containerized form with APIs to facilitate incorporation as a plug-in to other bioinformatic environments. The capabilities of OmicPioneer-sc are illustrated here through application to a single-cell RNA-seq airway dataset pertinent to the biology of both cancer and COVID-19.[Supplemental material is available for this article.]


Sign in / Sign up

Export Citation Format

Share Document