Biologically Supervised Hierarchical Clustering Algorithms for Gene Expression Data

Author(s):  
Grzegorz M Boratyn ◽  
Susmita Datta ◽  
Somnath Datta
2019 ◽  
Author(s):  
Kyungmin Ahn ◽  
Hironobu Fujiwara

AbstractBackgroundIn single-cell RNA-sequencing (scRNA-seq) data analysis, a number of statistical tools in multivariate data analysis (MDA) have been developed to help analyze the gene expression data. This MDA approach is typically focused on examining discrete genomic units of genes that ignores the dependency between the data components. In this paper, we propose a functional data analysis (FDA) approach on scRNA-seq data whereby we consider each cell as a single function. To avoid a large number of dropouts (zero or zero-closed values) and reduce the high dimensionality of the data, we first perform a principal component analysis (PCA) and assign PCs to be the amplitude of the function. Then we use the index of PCs directly from PCA for the phase components. This approach allows us to apply FDA clustering methods to scRNA-seq data analysis.ResultsTo demonstrate the robustness of our method, we apply several existing FDA clustering algorithms to the gene expression data to improve the accuracy of the classification of the cell types against the conventional clustering methods in MDA. As a result, the FDA clustering algorithms achieve superior accuracy on simulated data as well as real data such as human and mouse scRNA-seq data.ConclusionsThis new statistical technique enhances the classification performance and ultimately improves the understanding of stochastic biological processes. This new framework provides an essentially different scRNA-seq data analytical approach, which can complement conventional MDA methods. It can be truly effective when current MDA methods cannot detect or uncover the hidden functional nature of the gene expression dynamics.


2021 ◽  
Author(s):  
Dan Li ◽  
Hong Gu ◽  
Qiaozhen Chang ◽  
Jia Wang ◽  
Pan Qin

Abstract Clustering algorithms have been successfully applied to identify co-expressed gene groups from gene expression data. Missing values often occur in gene expression data, which presents a challenge for gene clustering. When partitioning incomplete gene expression data into co-expressed gene groups, missing value imputation and clustering are generally performed as two separate processes. These two-stage methods are likely to result in unsuitable imputation values for clustering task and unsatisfying clustering performance. This paper proposes a multi-objective joint optimization framework for clustering incomplete gene expression data that addresses this problem. The proposed framework can impute the missing expression values under the guidance of clustering, and therefore realize the synergistic improvement of imputation and clustering. In addition, gene expression similarity and gene semantic similarity extracted from the Gene Ontology are combined, as the form of functional neighbor interval for each missing expression value, to provide reasonable constraints for the joint optimization framework. Experiments on several benchmark data sets confirm the effectiveness of the proposed framework.


Author(s):  
Erliang Zeng ◽  
Chengyong Yang ◽  
Tao Li ◽  
Giri Narasimhan

Clustering of gene expression data is a standard exploratory technique used to identify closely related genes. Many other sources of data are also likely to be of great assistance in the analysis of gene expression data. This data provides a mean to begin elucidating the large-scale modular organization of the cell. The authors consider the challenging task of developing exploratory analytical techniques to deal with multiple complete and incomplete information sources. The Multi-Source Clustering (MSC) algorithm developed performs clustering with multiple, but complete, sources of data. To deal with incomplete data sources, the authors adopted the MPCK-means clustering algorithms to perform exploratory analysis on one complete source and other potentially incomplete sources provided in the form of constraints. This paper presents a new clustering algorithm MSC to perform exploratory analysis using two or more diverse but complete data sources, studies the effectiveness of constraints sets and robustness of the constrained clustering algorithm using multiple sources of incomplete biological data, and incorporates such incomplete data into constrained clustering algorithm in form of constraints sets.


Blood ◽  
2011 ◽  
Vol 118 (21) ◽  
pp. 3465-3465
Author(s):  
Daphne R. Friedman ◽  
Joseph R. Nevins

Abstract Abstract 3465 Background: Chronic lymphocytic leukemia (CLL), aggressive B-cell non-Hodgkin lymphomas (NHL), and multiple myeloma (MM) are B-cell malignancies that display biological and clinical heterogeneity. Current investigations into the genetics and biology of these related disorders are using next generation whole genome or exome sequencing. The relatively high cost of these techniques has driven an experimental design in which a small group of samples are initially studied, specific genetic lesions are identified, and then larger cohorts are evaluated for those specific aberrations. Given the biological heterogeneity that is found in each of these disorders, such an approach could skew the direction of research towards results found in a small subset of patients. To determine the extent of genomic heterogeneity within and similarities between CLL, NHL, and MM, and their biologic and clinical relevance, we evaluated publicly available gene expression and single nucleotide polymorphism (SNP) array data from the NCBI Gene Expression Omnibus. Methods: We analyzed 893, 881, and 1744 unique gene expression data files that represent CLL, NHL, and MM, respectively. The gene expression data files represented 15, 11, and 10 distinct data sets, respectively. Prognostic, clinical outcome, and copy number variation data were available for a subset of the samples from each malignancy. Gene expression data were initially normalized using RMA and MAS5 algorithms and batch effect was eliminated using Bayesian Factor Regression Modeling. SNP array data were normalized using Chromosome Copy Number Analysis Tool and amplifications and deletions were identified with circular binary segmentation. Analyses were carried out using Bioconductor packages and the statistical environment R. Results: After elimination of batch effect, we evaluated the data using random subsampling and unsupervised hierarchical clustering to determine the lowest number of samples required to capture genomic heterogeneity. For CLL and NHL, there was no plateau reached for the number of groups defined by hierarchical clustering up through the total number of samples, indicating that a larger number of samples than available in this study are needed to fully document biological and genomic variability. For MM, there was a plateau reached at approximately 1200 samples. We then used unsupervised hierarchical clustering of the entire dataset for each malignancy to define groups of CLL, NHL, and MM based on their raw gene expression data. To evaluate the biological meaning of the groups defined by this process, we used tools such as Gene Set Enrichment Analysis (GSEA) and oncogenic pathway predictions (ScoreSignatures). Groups within each malignancy that were defined using raw gene expression data had differences in biological pathways involving receptor signaling, cell cycle, and stem cell properties. Notably, similarities in biological annotation were seen between groups that represented the different malignancies. Although prognostic data was not available for all the datasets, there appeared to be no differences in clinical prognostic markers between the genomic-defined groups. However, there were statistically significant differences in molecular prognostic data between these groups. In addition, specific regions of DNA copy number variation were enriched within the different genomic-defined groups. Together, these data highlight the biologic distinctions between groups that are defined by raw gene expression data. For datasets in which clinical outcome data were available, we found that genomic-defined groups had different outcomes such as time to first therapy or overall survival. However, the groups did not appear to predict response to chemotherapy or chemo-immunotherapy. Conclusions: CLL, NHL, and MM are heterogeneous malignancies, and very large numbers of patients must be studied to fully capture the genomic and biologic diversity that is present. Despite this limitation, evaluation of existing data reveals subgroups of these disorders are defined by their underlying biology, demonstrate overlap in biological processes, and are clinically relevant. These results have implications on future “omics” related research. Disclosures: No relevant conflicts of interest to declare.


Sign in / Sign up

Export Citation Format

Share Document