A New semi-supervised clustering for incomplete data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189744 ◽

2021 ◽

pp. 1-13

Author(s):

Sonia Goel ◽

Meena Tushir

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Complete Data ◽

Unlabeled Data ◽

Misclassification Rate ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Supervised Clustering

Semi-supervised clustering technique partitions the unlabeled data based on prior knowledge of labeled data. Most of the semi-supervised clustering algorithms exist only for the clustering of complete data, i.e., the data sets with no missing features. In this paper, an effort has been made to check the effectiveness of semi-supervised clustering when applied to incomplete data sets. The novelty of this approach is that it considers the missing features along with available knowledge (labels) of the data set. The linear interpolation imputation technique initially imputes the missing features of the data set, thus completing the data set. A semi-supervised clustering is now employed on this complete data set, and missing features are regularly updated within the clustering process. In the proposed work, the labeled percentage range used is 30, 40, 50, and 60% of the total data. Data is further altered by arbitrarily eliminating certain features of its components, which makes the data incomplete with partial labeling. The proposed algorithm utilizes both labeled and unlabeled data, along with certain missing values in the data. The proposed algorithm is evaluated using three performance indices, namely the misclassification rate, random index metric, and error rate. Despite the additional missing features, the proposed algorithm has been successfully implemented on real data sets and showed better/competing results than well-known standard semi-supervised clustering methods.

Download Full-text

Robust K-Median and K-Means Clustering Algorithms for Incomplete Data

Mathematical Problems in Engineering ◽

10.1155/2016/4321928 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 6

Author(s):

Jinhua Li ◽

Shiji Song ◽

Yuli Zhang ◽

Zhen Zhou

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Interval Data ◽

Accurate Estimation ◽

Data Sets ◽

Clustering Methods ◽

Estimation Errors ◽

Feature Values ◽

Time And Space Complexity

Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and K-means. However, in practice, it is often hard to obtain accurate estimation of the missing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO) formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.

Download Full-text

Comparison of Algorithms for Clustering Incomplete Data

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2014-0007 ◽

2014 ◽

Vol 39 (2) ◽

pp. 107-127 ◽

Cited By ~ 6

Author(s):

Artur Matyja ◽

Krzysztof Siminski

Keyword(s):

Data Analysis ◽

Incomplete Data ◽

Missing Values ◽

Real Data ◽

Complete Data ◽

The Other ◽

Data Sets ◽

Missing Value ◽

Comparison Of Algorithms ◽

New Algorithms

Abstract The missing values are not uncommon in real data sets. The algorithms and methods used for the data analysis of complete data sets cannot always be applied to missing value data. In order to use the existing methods for complete data, the missing value data sets are preprocessed. The other solution to this problem is creation of new algorithms dedicated to missing value data sets. The objective of our research is to compare the preprocessing techniques and specialised algorithms and to find their most advantageous usage.

Download Full-text

GRAPH BASED CLUSTERING WITH CONSTRAINTS AND ACTIVE LEARNING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/37/1/15773 ◽

2021 ◽

Vol 37 (1) ◽

pp. 71-89

Author(s):

Vu-Tuan Dang ◽

Viet-Vu Vu ◽

Hong-Quan Do ◽

Thi Kieu Oanh Le

Keyword(s):

Active Learning ◽

Clustering Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Class Labels ◽

Graph Based Clustering

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.

Download Full-text

MODIFIED POSSIBILISTIC FUZZY C-MEANS ALGORITHM FOR CLUSTERING INCOMPLETE DATA SETS

Acta Polytechnica ◽

10.14311/ap.2021.61.0364 ◽

2021 ◽

Vol 61 (2) ◽

pp. 364-377

Author(s):

. Rustam ◽

Koredianto Usman ◽

Mudyawati Kamaruddin ◽

Dina Chamidah ◽

. Nopendri ◽

...

Keyword(s):

Experimental Data ◽

Incomplete Data ◽

Missing Values ◽

Complete Data ◽

Noise Sensitivity ◽

Data Sets ◽

Fuzzy C Means ◽

Number Of Iterations ◽

Fuzzy C Means Algorithm

A possibilistic fuzzy c-means (PFCM) algorithm is a reliable algorithm proposed to deal with the weaknesses associated with handling noise sensitivity and coincidence clusters in fuzzy c-means (FCM) and possibilistic c-means (PCM). However, the PFCM algorithm is only applicable to complete data sets. Therefore, this research modified the PFCM for clustering incomplete data sets to OCSPFCM and NPSPFCM with the performance evaluated based on three aspects, 1) accuracy percentage, 2) the number of iterations, and 3) centroid errors. The results showed that the NPSPFCM outperforms the OCSPFCM with missing values ranging from 5% − 30% for all experimental data sets. Furthermore, both algorithms provide average accuracies between 97.75%−78.98% and 98.86%−92.49%, respectively.

Download Full-text

Missing Values Compensation in Duplicates Detection Using Hot Deck

10.21203/rs.3.rs-390519/v1 ◽

2021 ◽

Author(s):

Abdulrazzak Ali ◽

Nurul A. Emran ◽

Siti A. Asmai

Keyword(s):

Performance Improvement ◽

Incomplete Data ◽

Missing Values ◽

Detection Method ◽

Data Sets ◽

Data Set ◽

Removal Process ◽

Matching Process ◽

A Performance

Abstract Duplicate record is a known problem within the datasets especially within databases of huge volumes. The accuracy of duplicates detection determines the efficiency of the duplicates removal process. Unfortunately, the effort to detect duplicates becomes more challenging due to the presence of missing values within the records. This is because, during the clustering and matching process, missing values can cause records that are similar to be assigned in a wrong group, causing the duplicates left undetected. In this paper, we present how duplicates detection can be improved even though missing values are present within a data set using our Duplicates Detection within the Incomplete Data set (DDID) method. We hypothetically add the missing values to the key attributes of two datasets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. We analyze the results to evaluate the performance of duplicates detection using the Hot Deck method to compensate for the missing values in the key attributes. We hypothesize that by using Hot Deck, there is a performance improvement in duplicates detection. The performance of the DDID is compared with an early duplicates detection method (called DuDe) in terms of its accuracy and speed. The findings of the experiment show that, even though the data sets are incomplete, DDID is capable to offer better accuracy and faster duplicates detection as compared to a benchmark method (called DuDe). The results of this study contribute to duplicates detection under incomplete data sets constraint.

Download Full-text

Missing values compensation in duplicates detection using hot deck method

Journal Of Big Data ◽

10.1186/s40537-021-00502-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Abdulrazzak Ali ◽

Nurul A. Emran ◽

Siti A. Asmai

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Detection Method ◽

Detection Performance ◽

Data Sets ◽

Data Set ◽

Duplicate Detection ◽

Removal Process ◽

Matching Process ◽

Study Offer

AbstractDuplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.

Download Full-text

A Review article on Semi- Supervised Clustering Framework for High Dimensional Data

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195410 ◽

2019 ◽

pp. 102-108

Author(s):

M. Pavithra ◽

R. M. S. Parvathi

Keyword(s):

Expert Knowledge ◽

Cluster Formation ◽

Clustering Algorithms ◽

Ensemble Member ◽

Unsupervised Clustering ◽

Outcome Variable ◽

High Dimensional ◽

Clustering Methods ◽

Data Set ◽

Supervised Clustering

Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features [2]. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as �semi-supervised clustering� methods) that can be applied in these situations [3]. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. Cluster formation has three types as supervised clustering, unsupervised clustering and semi supervised. This paper reviews traditional and state-of-the-art methods of clustering [1]. Clustering algorithms are based on active learning, with ensemble clustering-means algorithm, data streams with flock, fuzzy clustering for shape annotations, Incremental semi supervised clustering, Weakly supervised clustering, with minimum labeled data, self-organizing based on neural networks. Incremental semi-supervised clustering ensemble framework (ISSCE) which makes utilization of the advantage of the arbitrary subspace method, the limitation spread approach, the proposed incremental ensemble member choice process, and the normalized cut algorithm to perform high dimensional information clustering [4]. Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data.

Download Full-text

Robust clustering and interpretation of scRNA-seq data using reference component analysis

10.1101/2021.02.16.431527 ◽

2021 ◽

Author(s):

Florian Schmidt ◽

Bobby Ranjan ◽

Quy Xiao Xuan Lin ◽

Vaidehi Krishnan ◽

Ignasius Joanito ◽

...

Keyword(s):

Single Cell ◽

De Novo ◽

Clustering Algorithms ◽

Cell Types ◽

Unsupervised Clustering ◽

Data Sets ◽

Clustering Methods ◽

Robust Clustering ◽

Supervised Clustering ◽

Downstream Analysis

MotivationThe transcriptomic diversity of the hundreds of cell types in the human body can be analysed in unprecedented detail using single cell (SC) technologies. Though clustering of cellular transcriptomes is the default technique for defining cell types and subtypes, single cell clustering can be strongly influenced by technical variation. In fact, the prevalent unsupervised clustering algorithms can cluster cells by technical, rather than biological, variation.ResultsCompared to de novo (unsupervised) clustering methods, we demonstrate using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects. To leverage the advantages of supervised clustering, we present RCA2, a new, scalable, and broadly applicable version of our RCA algorithm. RCA2 provides a user-friendly framework for supervised clustering and downstream analysis of large scRNA-seq data sets. RCA2 can be seamlessly incorporated into existing algorithmic pipelines. It incorporates various new reference panels for human and mouse, supports generation of custom panels and uses efficient graph-based clustering and sparse data structures to ensure scalability. We demonstrate the applicability of RCA2 on SC data from human bone marrow, healthy PBMCs and PBMCs from COVID-19 patients. Importantly, RCA2 facilitates cell-type-specific QC, which we show is essential for accurate clustering of SC data from heterogeneous tissues. In the era of cohort-scale SC analysis, supervised clustering methods such as RCA2 will facilitate unified analysis of diverse SC datasets.AvailabilityRCA2 is implemented in R and is available at github.com/prabhakarlab/RCAv2

Download Full-text

A COMPARATIVE ANALYSIS OF K-MEANS AND HIERARCHICAL CLUSTERING

EPRA International Journal of Multidisciplinary Research (IJMR) ◽

10.36713/epra8308 ◽

2021 ◽

pp. 412-418

Author(s):

Aastha Gupta ◽

Himanshu Sharma ◽

Anas Akhtar

Keyword(s):

Data Mining ◽

Hierarchical Clustering ◽

Clustering Algorithms ◽

Analytical Techniques ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Advantages And Disadvantages ◽

The Many ◽

Data Elements

Clustering is the process of arranging comparable data elements into groups. One of the most frequent data mining analytical techniques is clustering analysis; the clustering algorithm’s strategy has a direct influence on the clustering results. This study examines the many types of algorithms, such as k-means clustering algorithms, and compares and contrasts their advantages and disadvantages. This paper also highlights concerns with clustering algorithms, such as time complexity and accuracy, in order to give better outcomes in a variety of environments. The outcomes are described in terms of big datasets. The focus of this study is on clustering algorithms with the WEKA data mining tool. Clustering is the process of dividing a big data set into small groups or clusters. Clustering is an unsupervised approach that may be used to analyze big datasets with many characteristics. It’s a data-modeling technique that provides a clear image of your data. Two clustering methods, k-means and hierarchical clustering, are explained in this survey and their analysis using WEKA tool on different data sets. KEYWORDS: data clustering, weka , k-means, hierarchical clustering

Download Full-text

CHOOSING SEEDS FOR SEMI-SUPERVISED GRAPH BASED CLUSTERING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/35/4/14123 ◽

2019 ◽

Vol 35 (4) ◽

pp. 373-384

Author(s):

Cuong Le ◽

Viet Vu Vu ◽

Le Thi Kieu Oanh ◽

Nguyen Thi Hai Yen

Keyword(s):

Learning Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Efficient Data ◽

Graph Based Clustering

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.

Download Full-text