A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering

Author(s):  
Rodrigo Randel ◽  
Daniel Aloise ◽  
Simon J. Blanchard ◽  
Alain Hertz
2020 ◽  
Vol 34 (10) ◽  
pp. 13863-13864
Author(s):  
Ting-En Lin ◽  
Hua Xu ◽  
Hanlei Zhang

Discovering new user intents is an emerging task in the dialogue system. In this paper, we propose a self-supervised clustering method that can naturally incorporate pairwise constraints as prior knowledge to guide the clustering process and does not require intensive feature engineering. Extensive experiments on three benchmark datasets show that our method can yield significant improvements over strong baselines.


Author(s):  
Yukihiro Hamasuna ◽  
◽  
Yasunori Endo ◽  
Sadaaki Miyamoto ◽  

This paper presents semi-supervised agglomerative hierarchical clustering algorithm using clusterwise tolerance based pairwise constraints. In semi-supervised clustering, pairwise constraints, that is, must-link and cannot-link, are frequently used in order to improve clustering properties. From that sense, we will propose another way named clusterwise tolerance based pairwise constraints to handle must-link and cannot-link constraints inL2-space. In addition, we will propose semi-supervised agglomerative hierarchical clustering algorithm based on it. We will, moreover, show the effectiveness of the proposed method through numerical examples.


2019 ◽  
Vol 123 ◽  
pp. 101715
Author(s):  
Md Abdul Masud ◽  
Joshua Zhexue Huang ◽  
Ming Zhong ◽  
Xianghua Fu

Author(s):  
Walid Atwa ◽  
◽  
Abdulwahab Ali Almazroi

Semi.-supervised clustering algorithms aim to enhance the performance of clustering using the pairwise constraints. However, selecting these constraints randomly or improperly can minimize the performance of clustering in certain situations and with different applications. In this paper, we select the most informative constraints to improve semi-supervised clustering algorithms. We present an active selection of constraints, including active must.-link (AML) and active cannot.-link (ACL) constraints. Based on Radial-Bases Function, we compute lower-bound and upper-bound between data points to select the constraints that improve the performance. We test the proposed algorithm with the base-line methods and show that our proposed active pairwise constraints outperform other algorithms.


2019 ◽  
Vol 35 (4) ◽  
pp. 373-384
Author(s):  
Cuong Le ◽  
Viet Vu Vu ◽  
Le Thi Kieu Oanh ◽  
Nguyen Thi Hai Yen

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.


Author(s):  
Yukihiro Hamasuna ◽  
◽  
Yasunori Endo ◽  

This paper presents a new semi-supervised agglomerative hierarchical clustering algorithm with the ward method using clusterwise tolerance. Semi-supervised clustering has recently been noted and studied in many research fields. Must-link and cannot-link, called pairwise constraints, are frequently used in order to improve clustering properties in semi-supervised clustering. First, clusterwise tolerance based pairwise constraints are introduced in order to handle mustlink and cannot-link constraints. Next, a new semisupervised hierarchical clustering algorithm with the ward method is constructed based on the above discussions. The effectiveness of the proposed algorithms is, moreover, verified through numerical examples.


Author(s):  
Bojun Yan

As a recent emerging technique, semi-supervised clustering has attracted significant research interest. Compared to traditional clustering algorithms, which only use unlabeled data, semi-supervised clustering employs both unlabeled and supervised data to obtain a partitioning that conforms more closely to the user’s preferences. Several recent papers have discussed this problem (Cohn, Caruana, & McCallum, 2003; Bar- Hillel, Hertz, Shental, & Weinshall, 2003; Xing, Ng, Jordan, & Russell, 2003; Basu, Bilenko, & Mooney, 2004; Kulis, Dhillon, & Mooney, 2005). In semi-supervised clustering, limited supervision is provided as input. The supervision can have the form of labeled data or pairwise constraints. In many applications it is natural to assume that pairwise constraints are available (Bar-Hillel, Hertz, Shental, & Weinshall, 2003; Wagstaff, Cardie, Rogers, & Schroedl, 2001). For example, in protein interaction and gene expression data (Segal, Wang, & Koller, 2003), pairwise constraints can be derived from the background domain knowledge. Similarly, in information and image retrieval, it is easy for the user to provide feedback concerning a qualitative measure of similarity or dissimilarity between pairs of objects. Thus, in these cases, although class labels may be unknown, a user can still specify whether pairs of points belong to the same cluster (Must-Link) or to different ones (Cannot-Link). Furthermore, a set of classified points implies an equivalent set of pairwise constraints, but not vice versa. Recently, a kernel method for semi-supervised clustering has been introduced (Kulis, Dhillon, & Mooney, 2005). This technique extends semi-supervised clustering to a kernel space, thus enabling the discovery of clusters with non-linear boundaries in input space. While a powerful technique, the applicability of a kernel-based semi-supervised clustering approach is limited in practice, due to the critical settings of kernel’s parameters. In fact, the chosen parameter values can largely affect the quality of the results. While solutions have been proposed in supervised learning to estimate the optimal kernel’s parameters, the problem presents open challenges when no labeled data are provided, and all we have available is a set of pairwise constraints.


Sign in / Sign up

Export Citation Format

Share Document