A survey on Optimisation-based Semi-supervised Clustering Methods

Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features [2]. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as �semi-supervised clustering� methods) that can be applied in these situations [3]. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. Cluster formation has three types as supervised clustering, unsupervised clustering and semi supervised. This paper reviews traditional and state-of-the-art methods of clustering [1]. Clustering algorithms are based on active learning, with ensemble clustering-means algorithm, data streams with flock, fuzzy clustering for shape annotations, Incremental semi supervised clustering, Weakly supervised clustering, with minimum labeled data, self-organizing based on neural networks. Incremental semi-supervised clustering ensemble framework (ISSCE) which makes utilization of the advantage of the arbitrary subspace method, the limitation spread approach, the proposed incremental ensemble member choice process, and the normalized cut algorithm to perform high dimensional information clustering [4]. Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data.

Download Full-text

Comparison of supervised clustering methods to discriminate genotoxic from non-genotoxic carcinogens by gene expression profiling

Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis ◽

10.1016/j.mrfmmm.2005.02.006 ◽

2005 ◽

Vol 575 (1-2) ◽

pp. 17-33 ◽

Cited By ~ 51

Author(s):

J.H.M. van Delft ◽

E. van Agen ◽

S.G.J. van Breda ◽

M.H. Herwijnen ◽

Y.C.M. Staal ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Profiling ◽

Expression Profiling ◽

Clustering Methods ◽

Supervised Clustering ◽

Genotoxic Carcinogens

Download Full-text

Robust clustering and interpretation of scRNA-seq data using reference component analysis

10.1101/2021.02.16.431527 ◽

2021 ◽

Author(s):

Florian Schmidt ◽

Bobby Ranjan ◽

Quy Xiao Xuan Lin ◽

Vaidehi Krishnan ◽

Ignasius Joanito ◽

...

Keyword(s):

Single Cell ◽

De Novo ◽

Clustering Algorithms ◽

Cell Types ◽

Unsupervised Clustering ◽

Data Sets ◽

Clustering Methods ◽

Robust Clustering ◽

Supervised Clustering ◽

Downstream Analysis

MotivationThe transcriptomic diversity of the hundreds of cell types in the human body can be analysed in unprecedented detail using single cell (SC) technologies. Though clustering of cellular transcriptomes is the default technique for defining cell types and subtypes, single cell clustering can be strongly influenced by technical variation. In fact, the prevalent unsupervised clustering algorithms can cluster cells by technical, rather than biological, variation.ResultsCompared to de novo (unsupervised) clustering methods, we demonstrate using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects. To leverage the advantages of supervised clustering, we present RCA2, a new, scalable, and broadly applicable version of our RCA algorithm. RCA2 provides a user-friendly framework for supervised clustering and downstream analysis of large scRNA-seq data sets. RCA2 can be seamlessly incorporated into existing algorithmic pipelines. It incorporates various new reference panels for human and mouse, supports generation of custom panels and uses efficient graph-based clustering and sparse data structures to ensure scalability. We demonstrate the applicability of RCA2 on SC data from human bone marrow, healthy PBMCs and PBMCs from COVID-19 patients. Importantly, RCA2 facilitates cell-type-specific QC, which we show is essential for accurate clustering of SC data from heterogeneous tissues. In the era of cohort-scale SC analysis, supervised clustering methods such as RCA2 will facilitate unified analysis of diverse SC datasets.AvailabilityRCA2 is implemented in R and is available at github.com/prabhakarlab/RCAv2

Download Full-text

Supervised Regression Clustering

International Journal of Business Analytics ◽

10.4018/ijban.2016100102 ◽

2016 ◽

Vol 3 (4) ◽

pp. 21-40 ◽

Cited By ~ 1

Author(s):

Ali Fallah Tehrani ◽

Diane Ahrens

Keyword(s):

Supervised Learning ◽

Data Analytics ◽

Apparel Industry ◽

Clustering Methods ◽

Specific Behavior ◽

Clustering Techniques ◽

Real Dataset ◽

Supervised Clustering ◽

Fashion Products

Clustering techniques typically group similar instances underlying individual attributes by supposing that similar instances have similar attributes characteristic. On contrary, clustering similar instances given a specific behavior is framed through supervised learning. For instance, which fashion products have similar behavior in term of sales. Unfortunately, conventional clustering methods cannot tackle this case, since they handle attributes by a same manner. In fact, conventional clustering approaches do not consider any response, and moreover they assume attributes act by the same importance. However, clustering instances with respect to responses leads to a better data analytics. In this research, the authors introduce an approach for the goal supervised clustering and show its advantage in terms of data analytics as well as prediction. To verify the feasibility and the performance of this approach the authors conducted several experiments on a real dataset derived from an apparel industry.

Download Full-text

Electricity consumption pattern analysis beyond traditional clustering methods: A novel self-adapting semi-supervised clustering method and application case study

Applied Energy ◽

10.1016/j.apenergy.2021.118335 ◽

2022 ◽

Vol 308 ◽

pp. 118335

Author(s):

Xiaohai Zhang ◽

José Luis Ramírez-Mendiola ◽

Mingtao Li ◽

Liejin Guo

Keyword(s):

Pattern Analysis ◽

Electricity Consumption ◽

Consumption Pattern ◽

Clustering Methods ◽

Clustering Method ◽

Supervised Clustering

Download Full-text

Comparison of Supervised Clustering Methods for the Analysis of DNA Microarray Expression Data

Agricultural Sciences in China ◽

10.1016/s1671-2927(08)60032-2 ◽

2008 ◽

Vol 7 (2) ◽

pp. 129-139

Author(s):

Jing XIAO ◽

Xue-feng WANG ◽

Ze-feng YANG ◽

Chen-wu XU

Keyword(s):

Dna Microarray ◽

Expression Data ◽

Clustering Methods ◽

Supervised Clustering ◽

Microarray Expression Data ◽

Microarray Expression

Download Full-text

Multi-scale semi-supervised clustering of brain images: deriving disease subtypes

10.1101/2021.04.19.440501 ◽

2021 ◽

Author(s):

Junhao WEN ◽

Erdem Varol ◽

Aristeidis Sotiras ◽

Zhijian Yang ◽

Ganesh B. Chand ◽

...

Keyword(s):

Control Sample ◽

Optimization Procedure ◽

Brain Diseases ◽

Imaging Data ◽

Clustering Methods ◽

Disease Heterogeneity ◽

Supervised Clustering ◽

Multi Scale ◽

Clustering Model ◽

Disease Subtypes

Disease heterogeneity is a significant obstacle to understanding pathological processes and delivering precision diagnostics and treatment. Clustering methods have gained popularity in stratifying patients into subpopulations (i.e., subtypes) of brain diseases using imaging data. However, unsupervised clustering approaches are often confounded by anatomical and functional variations not related to a disease or pathology of interest. Semi-supervised clustering techniques have been proposed to overcome this and, therefore, capture disease-specific patterns more effectively. An additional limitation of both unsupervised and semi-supervised conventional machine learning methods is that they typically model, learn and infer from data at a basis of feature sets pre- defined at a fixed scale or scales (e.g, an atlas-based regions of interest). Herein we propose a novel method, Multi-scAle heteroGeneity analysIs and Clustering (MAGIC), to depict the multi-scale presentation of disease heterogeneity, which builds on a previously proposed semi-supervised clustering method, HYDRA. It derives multi-scale and clinically interpretable feature representations and exploits a double-cyclic optimization procedure to drive inter-scale-consistent disease subtypes or neuroanatomical dimensions effectively. More importantly, to fill in the gap of understanding under what conditions the clustering model can estimate true heterogeneity related to diseases, we conducted extensive and systematic semi-simulated experiments to evaluate the proposed method on a sizeable healthy control sample from the UK Biobank (N=4403). We then applied MAGIC to real imaging data of Alzheimers disease (ADNI, N=1728) to demonstrate its potential and challenges in dissecting the neuroanatomical heterogeneity of brain diseases. Taken together, we aim to provide guidelines on when such analyses can succeed or should be taken with caution. The code of the proposed method is publicly available at https://github.com/anbai106/MAGIC.

Download Full-text