Coupled Co-clustering-based Unsupervised Transfer Learning for the Integrative Analysis of Single-Cell Genomic Data

AbstractUnsupervised methods, such as clustering methods, are essential to the analysis of single-cell genomic data. Most current clustering methods are designed for one data type only, such as scRNA-seq, scATAC-seq or sc-methylation data alone, and a few are developed for the integrative analysis of multiple data types. Integrative analysis of multimodal single-cell genomic data sets leverages the power in multiple data sets and can deepen the biological insight. We propose a coupled co-clustering-based unsupervised transfer learning algorithm (coupleCoC) for the integrative analysis of multimodal single-cell data. Our proposed coupleCoC builds upon the information theoretic co-clustering framework. We applied coupleCoC for the integrative analysis of scATAC-seq and scRNA-seq data, sc-methylation and scRNA-seq data, and scRNA-seq data from mouse and human. We demonstrate that coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic data sets. The software and data sets are available at https://github.com/cuhklinlab/coupleCoC.

Download Full-text

Coupled co-clustering-based unsupervised transfer learning for the integrative analysis of single-cell genomic data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa347 ◽

2020 ◽

Author(s):

Pengcheng Zeng ◽

Jiaxuan Wangwu ◽

Zhixiang Lin

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Genomic Data ◽

Integrative Analysis ◽

Data Sets ◽

Clustering Methods ◽

Genomic Features ◽

Multiple Data ◽

Multiple Data Sets ◽

Cell Data

Abstract Unsupervised methods, such as clustering methods, are essential to the analysis of single-cell genomic data. The most current clustering methods are designed for one data type only, such as single-cell RNA sequencing (scRNA-seq), single-cell ATAC sequencing (scATAC-seq) or sc-methylation data alone, and a few are developed for the integrative analysis of multiple data types. The integrative analysis of multimodal single-cell genomic data sets leverages the power in multiple data sets and can deepen the biological insight. In this paper, we propose a coupled co-clustering-based unsupervised transfer learning algorithm (coupleCoC) for the integrative analysis of multimodal single-cell data. Our proposed coupleCoC builds upon the information theoretic co-clustering framework. In co-clustering, both the cells and the genomic features are simultaneously clustered. Clustering similar genomic features reduces the noise in single-cell data and facilitates transfer of knowledge across single-cell datasets. We applied coupleCoC for the integrative analysis of scATAC-seq and scRNA-seq data, sc-methylation and scRNA-seq data and scRNA-seq data from mouse and human. We demonstrate that coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. Our method coupleCoC is also computationally efficient and can scale up to large datasets. Availability: The software and datasets are available at https://github.com/cuhklinlab/coupleCoC.

Download Full-text

Common and distinct variation in data fusion of designed experimental data

Metabolomics ◽

10.1007/s11306-019-1622-2 ◽

2019 ◽

Vol 16 (1) ◽

Cited By ~ 3

Author(s):

Masoumeh Alinaghi ◽

Hanne Christine Bertram ◽

Anders Brunse ◽

Age K. Smilde ◽

Johan A. Westerhuis

Keyword(s):

Experimental Design ◽

Data Fusion ◽

Component Analysis ◽

Biological Data ◽

Integrative Analysis ◽

Data Sets ◽

Metabolomics Data ◽

Simultaneous Component Analysis ◽

Multiple Data ◽

Multiple Data Sets

Abstract Introduction Integrative analysis of multiple data sets can provide complementary information about the studied biological system. However, data fusion of multiple biological data sets can be complicated as data sets might contain different sources of variation due to underlying experimental factors. Therefore, taking the experimental design of data sets into account could be of importance in data fusion concept. Objectives In the present work, we aim to incorporate the experimental design information in the integrative analysis of multiple designed data sets. Methods Here we describe penalized exponential ANOVA simultaneous component analysis (PE-ASCA), a new method for integrative analysis of data sets from multiple compartments or analytical platforms with the same underlying experimental design. Results Using two simulated cases, the result of simultaneous component analysis (SCA), penalized exponential simultaneous component analysis (P-ESCA) and ANOVA-simultaneous component analysis (ASCA) are compared with the proposed method. Furthermore, real metabolomics data obtained from NMR analysis of two different brains tissues (hypothalamus and midbrain) from the same piglets with an underlying experimental design is investigated by PE-ASCA. Conclusions This method provides an improved understanding of the common and distinct variation in response to different experimental factors.

Download Full-text

Integrated analysis of multimodal single-cell data

10.1101/2020.10.12.335331 ◽

2020 ◽

Cited By ~ 3

Author(s):

Yuhan Hao ◽

Stephanie Hao ◽

Erica Andersen-Nissen ◽

William M. Mauck ◽

Shiwei Zheng ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

White Blood Cells ◽

Integrative Analysis ◽

Integrated Analysis ◽

Data Types ◽

Multiple Modalities ◽

Multimodal Analysis ◽

Multiple Data ◽

Definition Of

AbstractThe simultaneous measurement of multiple modalities, known as multimodal analysis, represents an exciting frontier for single-cell genomics and necessitates new computational methods that can define cellular states based on multiple data types. Here, we introduce ‘weighted-nearest neighbor’ analysis, an unsupervised framework to learn the relative utility of each data type in each cell, enabling an integrative analysis of multiple modalities. We apply our procedure to a CITE-seq dataset of hundreds of thousands of human white blood cells alongside a panel of 228 antibodies to construct a multimodal reference atlas of the circulating immune system. We demonstrate that integrative analysis substantially improves our ability to resolve cell states and validate the presence of previously unreported lymphoid subpopulations. Moreover, we demonstrate how to leverage this reference to rapidly map new datasets, and to interpret immune responses to vaccination and COVID-19. Our approach represents a broadly applicable strategy to analyze single-cell multimodal datasets, including paired measurements of RNA and chromatin state, and to look beyond the transcriptome towards a unified and multimodal definition of cellular identity.AvailabilityInstallation instructions, documentation, tutorials, and CITE-seq datasets are available at http://www.satijalab.org/seurat

Download Full-text

coupleCoC+: an information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data

10.1101/2021.02.17.431728 ◽

2021 ◽

Author(s):

Pengcheng Zeng ◽

Zhixiang Lin

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Genomic Data ◽

Cell Types ◽

Integrative Analysis ◽

Computationally Efficient ◽

Information Theoretic ◽

Mouse Cortex ◽

Source Data ◽

Target Data

AbstractTechnological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, and mouse cortex sc-methylation and scRNA-seq data, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC plus.

Download Full-text

Integrative Analysis of Gene Networks and Their Application to Lung Adenocarcinoma Studies

Cancer Informatics ◽

10.1177/1176935117690778 ◽

2017 ◽

Vol 16 ◽

pp. 117693511769077

Author(s):

Sangin Lee ◽

Faming Liang ◽

Ling Cai ◽

Guanghua Xiao

Keyword(s):

Lung Adenocarcinoma ◽

Graphical Models ◽

Gene Networks ◽

Integrative Analysis ◽

Biological Knowledge ◽

Data Sets ◽

Gene Expressions ◽

Data Set ◽

Multiple Data ◽

Multiple Data Sets

The construction of gene regulatory networks (GRNs) is an essential component of biomedical research to determine disease mechanisms and identify treatment targets. Gaussian graphical models (GGMs) have been widely used for constructing GRNs by inferring conditional dependence among a set of gene expressions. In practice, GRNs obtained by the analysis of a single data set may not be reliable due to sample limitations. Therefore, it is important to integrate multiple data sets from comparable studies to improve the construction of a GRN. In this article, we introduce an equivalent measure of partial correlation coefficients in GGMs and then extend the method to construct a GRN by combining the equivalent measures from different sources. Furthermore, we develop a method for multiple data sets with a natural missing mechanism to accommodate the differences among different platforms in multiple sources of data. Simulation results show that this integrative analysis outperforms the standard methods and can detect hub genes in the true network. The proposed integrative method was applied to 12 lung adenocarcinoma data sets collected from different studies. The constructed network is consistent with the current biological knowledge and reveals new insights about lung adenocarcinoma.

Download Full-text

Variational autoencoders for cancer data integration: design principles and computational practice

10.1101/719542 ◽

2019 ◽

Cited By ~ 2

Author(s):

Nikola Simidjievski ◽

Cristian Bodnar ◽

Ifrah Tariq ◽

Paul Scherer ◽

Helena Andres-Terre ◽

...

Keyword(s):

Breast Cancer ◽

Clinical Data ◽

Molecular Taxonomy ◽

Patient Data ◽

Data Sets ◽

Data Types ◽

Cancer Data ◽

Learning Framework ◽

Multiple Data ◽

Multiple Data Sets

ABSTRACTInternational initiatives such as the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) are collecting multiple data sets at different genome-scales with the aim to identify novel cancer bio-markers and predict patient survival. To analyse such data, several machine learning, bioinformatics and statistical methods have been applied, among them neural networks such as autoencoders. Although these models provide a good statistical learning framework to analyse multi-omic and/or clinical data, there is a distinct lack of work on how to integrate diverse patient data and identify the optimal design best suited to the available data.In this paper, we investigate several autoencoder architectures that integrate a variety of cancer patient data types (e.g., multi-omics and clinical data). We perform extensive analyses of these approaches and provide a clear methodological and computational framework for designing systems that enable clinicians to investigate cancer traits and translate the results into clinical applications. We demonstrate how these networks can be designed, built and, in particular, applied to tasks of integrative analyses of heterogeneous breast cancer data. The results show that these approaches yield relevant data representations that, in turn, lead to accurate and stable diagnosis.

Download Full-text

coupleCoC+: An information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009064 ◽

2021 ◽

Vol 17 (6) ◽

pp. e1009064

Author(s):

Pengcheng Zeng ◽

Zhixiang Lin

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Genomic Data ◽

Cell Types ◽

Integrative Analysis ◽

Computationally Efficient ◽

Information Theoretic ◽

Mouse Cortex ◽

Source Data ◽

Target Data

Technological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, mouse cortex sc-methylation and scRNA-seq data, and human blood dendritic cells scRNA-seq data from two batches, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC_plus.

Download Full-text

Heterogeneous Effects of the De Jure and De Facto Business Environment: Findings from Multiple Data Sets on the Business Environment

10.1596/1813-9450-9115 ◽

2020 ◽

Author(s):

Christine Zhenwei Qiang ◽

He Wang ◽

L. Colin Xu

Keyword(s):

Business Environment ◽

Data Sets ◽

Multiple Data ◽

Heterogeneous Effects ◽

Multiple Data Sets

Download Full-text

Establishment of a prognostic model of four genes in gastric cancer based on multiple data sets

Cancer Medicine ◽

10.1002/cam4.3654 ◽

2021 ◽

Author(s):

Liqiang Zhou ◽

Shi H. Li ◽

You Wu ◽

Lin Xin

Keyword(s):

Gastric Cancer ◽

Prognostic Model ◽

Data Sets ◽

Multiple Data ◽

Multiple Data Sets

Download Full-text

The development of nurses’ foundational values

Nursing Ethics ◽

10.1177/09697330211003222 ◽

2021 ◽

pp. 096973302110032

Author(s):

Sastrawan Sastrawan ◽

Jennifer Weller-Newton ◽

Gabrielle Brand ◽

Gulzar Malik

Keyword(s):

Professional Training ◽

Reference Points ◽

Constructivist Grounded Theory ◽

Professional Values ◽

Data Sets ◽

Value System ◽

Institutional Values ◽

Multiple Data ◽

Human Ethics ◽

Multiple Data Sets

Background: In the ever-changing and complex healthcare environment, nurses encounter challenging situations that may involve a clash between their personal and professional values resulting in a profound impact on their practice. Nevertheless, there is a dearth of literature on how nurses develop their personal–professional values. Aim: The aim of this study was to understand how nurses develop their foundational values as the base for their value system. Research design: A constructivist grounded theory methodology was employed to collect multiple data sets, including face-to-face focus group and individual interviews, along with anecdote and reflective stories. Participants and research context: Fifty-four nurses working across various nursing settings in Indonesia were recruited to participate. Ethical considerations: Ethics approval was obtained from the Monash University Human Ethics Committee, project approval number 1553. Findings: Foundational values acquisition was achieved through family upbringing, professional nurse education and organisational/institutional values reinforcement. These values are framed through three reference points: religious lens, humanity perspective and professionalism. This framing results in a unique combination of personal–professional values that comprise nurses’ values system. Values are transferred to other nurses either in a formal or informal way as part of one’s professional responsibility and customary social interaction via telling and sharing in person or through social media. Discussion: Values and ethics are inherently interweaved during nursing practice. Ethical and moral values are part of professional training, but other values are often buried in a hidden curriculum, and attained and activated through interactions during nurses’ training. Conclusion: Developing a value system is a complex undertaking that involves basic social processes of attaining, enacting and socialising values. These processes encompass several intertwined entities such as the sources of values, the pool of foundational values, value perspectives and framings, initial value structures, and methods of value transference.

Download Full-text