Evaluating the Quality of Clustering Algorithms Using Cluster Path Lengths

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

Maintainability Evaluation of Object-Oriented Software System Using Clustering Techniques

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v5i2.3535 ◽

2013 ◽

Vol 5 (2) ◽

pp. 136-143 ◽

Cited By ~ 1

Author(s):

Astha Mehra ◽

Sanjay Kumar Dubey

Keyword(s):

Clustering Algorithms ◽

Daily Basis ◽

Computer Assisted ◽

Program Execution ◽

Huge Data ◽

Large Databases ◽

Input Dataset ◽

Data Objects

In todayâ€™s world data is produced every day at a phenomenal rate and we are required to store this ever growing data on almost daily basis. Even though our ability to store this huge data has grown but the problem lies when users expect sophisticated information from this data. This can be achieved by uncovering the hidden information from the raw data, which is the purpose of data mining.Â Data mining or knowledge discovery is the computer-assisted process of digging through and analyzing enormous set of data and then extracting the meaning out of it. The raw and unlabeled data present in large databases can be classified initially in an unsupervised manner by making use of cluster analysis. Clustering analysis is the process of finding the groups of objects such that the objects in a group will be similar to one another and dissimilar from the objects in other groups. These groups are known as clusters.Â In other words, clustering is the process of organizing the data objects in groups whose members have some similarity among them. Some of the applications of clustering are in marketing -finding group of customers with similar behavior, biology- classification of plants and animals given their features, data analysis, and earthquake study -observe earthquake epicenter to identify dangerous zones, WWW -document classification, etc. The results or outcome and efficiency of clustering process is generally identified though various clustering algorithms. The aim of this research paper is to compare two important clustering algorithms namely centroid based K-means and X-means. The performance of the algorithms is evaluated in different program execution on the same input dataset. The performance of these algorithms is analyzed and compared on the basis of quality of clustering outputs, number of iterations and cut-off factors.

Download Full-text

An Approach to Artificial Concept Learning Based on Human Concept Learning by Using Artificial Neural Networks

Advancing Artificial Intelligence through Biological Process Applications ◽

10.4018/978-1-59904-996-0.ch008 ◽

2011 ◽

pp. 130-145

Author(s):

Enrique Mérida-Casermeiro ◽

Domingo López-Rodríguez ◽

J.M. Ortiz-de-Lazcano-Lobato

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Associative Memory ◽

Concept Learning ◽

Energy Function ◽

Hebbian Learning ◽

Clustering Algorithms ◽

Network Capacity ◽

New Model

In this chapter, two important issues concerning associative memory by neural networks are studied: a new model of hebbian learning, as well as the effect of the network capacity when retrieving patterns and performing clustering tasks. Particularly, an explanation of the energy function when the capacity is exceeded: the limitation in pattern storage implies that similar patterns are going to be identified by the network, therefore forming different clusters. This ability can be translated as an unsupervised learning of pattern clusters, with one major advantage over most clustering algorithms: the number of data classes is automatically learned, as confirmed by the experiments. Two methods to reinforce learning are proposed to improve the quality of the clustering, by enhancing the learning of patterns relationships. As a related issue, a study on the net capacity, depending on the number of neurons and possible outputs, is presented, and some interesting conclusions are commented.

Download Full-text

Multi-View Clustering in Latent Embedding Space

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5756 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3513-3520 ◽

Cited By ~ 2

Author(s):

Man-Sheng Chen ◽

Ling Huang ◽

Chang-Dong Wang ◽

Dong Huang

Keyword(s):

Structure Learning ◽

Clustering Algorithms ◽

Feature Space ◽

Global Structure ◽

Optimization Scheme ◽

Optimization Framework ◽

Novel Approach ◽

Indicator Matrix ◽

Original Feature

Previous multi-view clustering algorithms mostly partition the multi-view data in their original feature space, the efficacy of which heavily and implicitly relies on the quality of the original feature presentation. In light of this, this paper proposes a novel approach termed Multi-view Clustering in Latent Embedding Space (MCLES), which is able to cluster the multi-view data in a learned latent embedding space while simultaneously learning the global structure and the cluster indicator matrix in a unified optimization framework. Specifically, in our framework, a latent embedding representation is firstly discovered which can effectively exploit the complementary information from different views. The global structure learning is then performed based on the learned latent embedding representation. Further, the cluster indicator matrix can be acquired directly with the learned global structure. An alternating optimization scheme is introduced to solve the optimization problem. Extensive experiments conducted on several real-world multi-view datasets have demonstrated the superiority of our approach.

Download Full-text

Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0098 ◽

2018 ◽

Vol 29 (1) ◽

pp. 1109-1121

Author(s):

Mohsen Pourvali ◽

Salvatore Orlando

Keyword(s):

Clustering Algorithms ◽

Ensemble Clustering ◽

British Broadcasting Corporation ◽

Text Documents ◽

Classical Text ◽

Text Corpora ◽

Clustering Quality ◽

Semantic Expansion ◽

Document Representations

Abstract This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.

Download Full-text

Clustering Algorithms for Direct Current Track Coded Signals

2019 Joint Rail Conference ◽

10.1115/jrc2019-1300 ◽

2019 ◽

Author(s):

Song Qin ◽

Nenad Mijatovic ◽

Jeffrey Fries ◽

James Kiss

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

High Availability ◽

Digital Analysis ◽

Signaling Systems ◽

Track Circuits ◽

Service Conditions ◽

Track Circuit ◽

Fail Safe

Designed for detecting train presence on tracks, track circuits must maintain a level of high availability for railway signaling systems. Due to the fail-safe nature of these critical devices, any failures will result in a declaration of occupancy in a section of track which restricts train movements. It is possible to automatically diagnose and, in some cases, predict the failures of track circuits by performing analytics on the track signals. In order to perform these analytics, we need to study the coded signals transmitted to and received from the track. However, these signals consist of heterogeneous pulses that are noisy for data analysis. Thus, we need techniques which will automatically group homogeneous pulses into similar groups. In this paper, we present data cleansing techniques which will cluster pulses based on digital analysis and machine learning. We report the results of our evaluation of clustering algorithms that improve the quality of analytic data. The data were captured under revenue service conditions operated by Alstom. For clustering algorithm, we used the k-means algorithm to cluster heterogeneous pulses. By tailoring the parameters for this algorithm, we can control the pulses of the cluster, allowing for further analysis of the track circuit signals in order to gain insight regarding its performance.

Download Full-text

Understanding and Enhancement of Internal Clustering Validation Indexes for Categorical Data

Algorithms ◽

10.3390/a11110177 ◽

2018 ◽

Vol 11 (11) ◽

pp. 177 ◽

Cited By ~ 2

Author(s):

Xuedong Gao ◽

Minghan Yang

Keyword(s):

Machine Learning ◽

Categorical Data ◽

Data Clustering ◽

Information Gain ◽

Clustering Algorithms ◽

Number Of Clusters ◽

Cluster Compactness ◽

Clustering Validation ◽

Categorical Data Clustering

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.

Download Full-text

Fine-Tuning an Algorithm for Semantic Document Clustering Using a Similarity Graph

International Journal of Semantic Computing ◽

10.1142/s1793351x16400195 ◽

2016 ◽

Vol 10 (04) ◽

pp. 527-555

Author(s):

Lubomir Stanchev

Keyword(s):

English Language ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Document Clustering ◽

Fine Tuning ◽

Human Judgment ◽

Multiple Parameters ◽

Similarity Graph ◽

Multiple Metrics

In this article, we examine an algorithm for document clustering using a similarity graph. The graph stores words and common phrases from the English language as nodes and it can be used to compute the degree of semantic similarity between any two phrases. One application of the similarity graph is semantic document clustering, that is, grouping documents based on the meaning of the words in them. Since our algorithm for semantic document clustering relies on multiple parameters, we examine how fine-tuning these values affects the quality of the result. Specifically, we use the Reuters-21578 benchmark, which contains [Formula: see text] newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We evaluate the results of the clustering algorithms using multiple metrics, such as precision, recall, f-score, entropy, and purity.

Download Full-text

A SCALABLE CLUSTERING METHOD FOR CATEGORICAL SEQUENCE DATA

International Journal of Computational Methods ◽

10.1142/s0219876205000417 ◽

2005 ◽

Vol 02 (02) ◽

pp. 167-180

Author(s):

SEUNG-JOON OH ◽

JAE-YEARN KIM

Keyword(s):

Nearest Neighbor ◽

Sequence Data ◽

Clustering Algorithms ◽

K Nearest Neighbor ◽

Clustering Method ◽

Scalable Clustering ◽

Log Files ◽

Web Access ◽

Better Than

Clustering of sequences is relatively less explored but it is becoming increasingly important in data mining applications such as web usage mining and bioinformatics. The web user segmentation problem uses web access log files to partition a set of users into clusters such that users within one cluster are more similar to one another than to the users in other clusters. Similarly, grouping protein sequences that share a similar structure can help to identify sequences with similar functions. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a splice dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

Download Full-text

A New Length-Based Algebraic Multigrid Clustering Algorithm

VLSI Design ◽

10.1155/2012/395260 ◽

2012 ◽

Vol 2012 ◽

pp. 1-14

Author(s):

L. Rakai ◽

A. Farshidi ◽

L. Behjat ◽

D. Westwick

Keyword(s):

Clustering Algorithm ◽

A Priori ◽

Clustering Algorithms ◽

Algebraic Multigrid ◽

Estimation Technique ◽

Wire Length ◽

Clustering Technique ◽

Length Estimation ◽

Made In

Clustering algorithms have been used to improve the speed and quality of placement. Traditionally, clustering focuses on the local connections between cells. In this paper, a new clustering algorithm that is based on the estimated lengths of circuit interconnects and the connectivity is proposed. In the proposed algorithm, first an a priori length estimation technique is used to estimate the lengths of nets. Then, the estimated lengths are used in a clustering framework to modify a clustering technique based on algebraic multigrid (AMG), that finds the cells with the highest connectivity. Finally, based on the results from the AMG-based process, clusters are made. In addition, a new physical unclustering technique is proposed. The results show a significant improvement, reductions of up to 40%, in wire length can be achieved when using the proposed technique with three academic placers on industry-based circuits. Moreover, the runtime is not significantly degraded and can even be improved.

Download Full-text