An Ordinal Data Clustering Algorithm with Automated Distance Learning

Clustering ordinal data is a common task in data mining and machine learning fields. As a major type of categorical data, ordinal data is composed of attributes with naturally ordered possible values (also called categories interchangeably in this paper). However, due to the lack of dedicated distance metric, ordinal categories are usually treated as nominal ones, or coded as consecutive integers and treated as numerical ones. Both these two common ways will roughly define the distances between ordinal categories because the former way ignores the order relationship and the latter way simply assigns identical distances to different pairs of adjacent categories that may have intrinsically unequal distances. As a result, they may produce unsatisfactory ordinal data clustering results. This paper, therefore, proposes a novel ordinal data clustering algorithm, which iteratively learns: 1) The partition of ordinal dataset, and 2) the inter-category distances. To the best of our knowledge, this is the first attempt to dynamically adjust inter-category distances during the clustering process to search for a better partition of ordinal data. The proposed algorithm features superior clustering accuracy, low time complexity, fast convergence, and is parameter-free. Extensive experiments show its efficacy.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Analysis of Fuzzy Clustering for the Adoption in Data Mining

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.989-994.2047 ◽

2014 ◽

Vol 989-994 ◽

pp. 2047-2050

Author(s):

Ying Jie Wang

Keyword(s):

Machine Learning ◽

Data Mining ◽

Mathematical Method ◽

Big Data ◽

Fuzzy Clustering ◽

Clustering Analysis ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Methods ◽

Fuzzy Clustering Methods

Data mining is the general methodology for retrieving useful information from big data. Clustering analysis is a mathematical method of classification for unsupervised machine learning. It can be adopted for data classification in Data mining. This paper combines the clustering process by fuzzy way and then deduces a special clustering algorithm with fast fuzzy c-means (FFCM) method. In summary, the paper illustrates the adoption of a series of fuzzy clustering methods in Data Mining. These methods have improved the computational efficiency with learning as the convergence speed is fast. The methodology of this paper presents significantly meaningful for information retrieval of big data.

Download Full-text

A Novel Cosine Similarity Like Data Clustering Method for Effective Data Classification in Data Mining

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.h6417.069820 ◽

2020 ◽

Vol 9 (8) ◽

pp. 340-346

Keyword(s):

Data Mining ◽

Similarity Measure ◽

Categorical Data ◽

Data Clustering ◽

Similarity Measures ◽

Numerical Data ◽

Data Classification ◽

Fundamental Goal ◽

Learning Technique ◽

Categorical Data Clustering

In data mining ample techniques use distance based measures for data clustering. Improving clustering performance is the fundamental goal in cluster domain related tasks. Many techniques are available for clustering numerical data as well as categorical data. Clustering is an unsupervised learning technique and objects are grouped or clustered based on similarity among the objects. A new cluster similarity finding measure, which is cosine like cluster similarity measure (CLCSM), is proposed in this paper. The proposed cluster similarity measure is used for data classification. Extensive experiments are conducted by taking UCI machine learning datasets. The experimental results have shown that the proposed cosinelike cluster similarity measure is superior to many of the existing cluster similarity measures for data classification.

Download Full-text

Improved Bidirectional CABOSFV Based on Multi-Adjustment Clustering and Simulated Annealing

Cybernetics and Information Technologies ◽

10.1515/cait-2016-0075 ◽

2016 ◽

Vol 16 (6) ◽

pp. 27-42 ◽

Cited By ~ 1

Author(s):

Minghan Yang ◽

Xuedong Gao ◽

Ling Li

Keyword(s):

Simulated Annealing ◽

Data Clustering ◽

Time Complexity ◽

Clustering Algorithm ◽

Feature Vector ◽

Parameter Determination ◽

Data Sets ◽

Parameter Vector ◽

Clustering Validity

Abstract Although Clustering Algorithm Based on Sparse Feature Vector (CABOSFV) and its related algorithms are efficient for high dimensional sparse data clustering, there exist several imperfections. Such imperfections as subjective parameter designation and order sensibility of clustering process would eventually aggravate the time complexity and quality of the algorithm. This paper proposes a parameter adjustment method of Bidirectional CABOSFV for optimization purpose. By optimizing Parameter Vector (PV) and Parameter Selection Vector (PSV) with the objective function of clustering validity, an improved Bidirectional CABOSFV algorithm using simulated annealing is proposed, which circumvents the requirement of initial parameter determination. The experiments on UCI data sets show that the proposed algorithm, which can perform multi-adjustment clustering, has a higher accurateness than single adjustment clustering, along with a decreased time complexity through iterations.

Download Full-text

Fuzzy Clustering of Web User Profiles for Analyzing their Behavior and Interests

Fuzzy Methods for Customer Relationship Management and Marketing ◽

10.4018/978-1-4666-0095-9.ch005 ◽

2012 ◽

pp. 91-114

Author(s):

Stanislav Kreuzer ◽

Natascha Hoebel

Keyword(s):

Data Mining ◽

Consumer Behavior ◽

Fuzzy Clustering ◽

Data Clustering ◽

Ordinal Data ◽

Customer Relationships ◽

User Profiles ◽

Data Mining Techniques

One of the keys to building effective e-customer relationships is an understanding of consumer behavior online. However, analyzing the behavior of customers online is not necessarily an indicator of their interests. Therefore, building profiles of registered users of a website is of importance if it goes beyond collecting obvious information the user is willing to give at the time of the registration. These user profiles can contribute to the analysis of the users’ interests. Important tools for the analysis are data-mining techniques, for example, the clustering of collected user information. This chapter addresses the problem of how to define, calculate, and visualize fuzzy clusters of Web visitors with respect to their behavior and supposed interests. This chapter shows how to cluster Web users based on their profile and by their similar interests in several topics using the fuzzy and hybrid CORD (Clustering of Ordinal Data) clustering system, which is part of the Gugubarra Framework.

Download Full-text

Understanding and Enhancement of Internal Clustering Validation Indexes for Categorical Data

Algorithms ◽

10.3390/a11110177 ◽

2018 ◽

Vol 11 (11) ◽

pp. 177 ◽

Cited By ~ 2

Author(s):

Xuedong Gao ◽

Minghan Yang

Keyword(s):

Machine Learning ◽

Categorical Data ◽

Data Clustering ◽

Information Gain ◽

Clustering Algorithms ◽

Number Of Clusters ◽

Cluster Compactness ◽

Clustering Validation ◽

Categorical Data Clustering

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.

Download Full-text

Hebbian Self-Organizing Integrate-and-Fire Networks for Data Clustering

Neural Computation ◽

10.1162/neco.2009.12-08-926 ◽

2010 ◽

Vol 22 (1) ◽

pp. 273-288 ◽

Cited By ~ 16

Author(s):

Florian Landis ◽

Thomas Ott ◽

Ruedi Stoop

Keyword(s):

Data Clustering ◽

Time Complexity ◽

Arbitrary Shape ◽

Clustering Algorithm ◽

Hebbian Learning ◽

Spiking Neurons ◽

Integrate And Fire ◽

Background Data ◽

Homogeneous Regions ◽

Noisy Background

We propose a Hebbian learning-based data clustering algorithm using spiking neurons. The algorithm is capable of distinguishing between clusters and noisy background data and finds an arbitrary number of clusters of arbitrary shape. These properties render the approach particularly useful for visual scene segmentation into arbitrarily shaped homogeneous regions. We present several application examples, and in order to highlight the advantages and the weaknesses of our method, we systematically compare the results with those from standard methods such as the k-means and Ward's linkage clustering. The analysis demonstrates that not only the clustering ability of the proposed algorithm is more powerful than those of the two concurrent methods, the time complexity of the method is also more modest than that of its generally used strongest competitor.

Download Full-text

Dbscan Assisted by Hybrid Genetic K Means Algorithm

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f8061.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 1973-1979

Keyword(s):

Data Mining ◽

Time Complexity ◽

Clustering Algorithm ◽

Hybrid Approach ◽

Research Direction ◽

Computational Time ◽

Main Concern ◽

Complex Data ◽

Data Mining Algorithms ◽

Mining Algorithms

The data mining algorithms functioning is main concern, when the data becomes to a greater extent. Clustering analysis is a active and dispute research direction in the region of data mining for complex data samples. DBSCAN is a density-based clustering algorithm with several advantages in numerous applications. However, DBSCAN has quadratic time complexity i.e. making it complicated for realistic applications particularly with huge complex data samples. Therefore, this paper recommended a hybrid approach to reduce the time complexity by exploring the core properties of the DBSCAN in the initial stage using genetic based K-means partition algorithm. The technological experiments showed that the proposed hybrid approach obtains competitive results when compared with the usual approach and drastically improves the computational time.

Download Full-text

Improvement of Support Vector Machine Algorithm in Big Data Background

Mathematical Problems in Engineering ◽

10.1155/2021/5594899 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Babacar Gaye ◽

Dezheng Zhang ◽

Aziguli Wulamu

Keyword(s):

Machine Learning ◽

Data Mining ◽

Big Data ◽

Time Complexity ◽

Dual Problem ◽

Learning Algorithm ◽

Rapid Development ◽

Machine Learning Algorithms ◽

Support Vector ◽

Original Space

With the rapid development of the Internet and the rapid development of big data analysis technology, data mining has played a positive role in promoting industry and academia. Classification is an important problem in data mining. This paper explores the background and theory of support vector machines (SVM) in data mining classification algorithms and analyzes and summarizes the research status of various improved methods of SVM. According to the scale and characteristics of the data, different solution spaces are selected, and the solution of the dual problem is transformed into the classification surface of the original space to improve the algorithm speed. Research Process. Incorporating fuzzy membership into multicore learning, it is found that the time complexity of the original problem is determined by the dimension, and the time complexity of the dual problem is determined by the quantity, and the dimension and quantity constitute the scale of the data, so it can be based on the scale of the data Features Choose different solution spaces. The algorithm speed can be improved by transforming the solution of the dual problem into the classification surface of the original space. Conclusion. By improving the calculation rate of traditional machine learning algorithms, it is concluded that the accuracy of the fitting prediction between the predicted data and the actual value is as high as 98%, which can make the traditional machine learning algorithm meet the requirements of the big data era. It can be widely used in the context of big data.

Download Full-text

Ensemble Hybrid K- Means and DBSCAN Clustering Algorithm – HDKA for Cancer Dataset

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8257.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 6036-6040

Keyword(s):

Machine Learning ◽

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Cancer Dataset ◽

Dbscan Clustering ◽

Selection Of

Data Mining is the foremost vital space of analysis and is pragmatically utilized in totally different domains, It becomes a highly demanding field because huge amounts of data have been collected in various applications. The database can be clustered in more number of ways depending on the clustering algorithm used, parameter settings and other factors. Multiple clustering algorithms can be combined to get the final partitioning of data which provides better clustering results. In this paper, Ensemble hybrid KMeans and DBSCAN (HDKA) algorithm has been proposed to overcome the drawbacks of DBSCAN and KMeans clustering algorithms. The performance of the proposed algorithm improves the selection of centroid points through the centroid selection strategy.For experimental results we have used two dataset Colon and Leukemia from UCI machine learning repository.

Download Full-text