scholarly journals An Ordinal Data Clustering Algorithm with Automated Distance Learning

2020 ◽  
Vol 34 (04) ◽  
pp. 6869-6876
Author(s):  
Yiqun Zhang ◽  
Yiu-ming Cheung

Clustering ordinal data is a common task in data mining and machine learning fields. As a major type of categorical data, ordinal data is composed of attributes with naturally ordered possible values (also called categories interchangeably in this paper). However, due to the lack of dedicated distance metric, ordinal categories are usually treated as nominal ones, or coded as consecutive integers and treated as numerical ones. Both these two common ways will roughly define the distances between ordinal categories because the former way ignores the order relationship and the latter way simply assigns identical distances to different pairs of adjacent categories that may have intrinsically unequal distances. As a result, they may produce unsatisfactory ordinal data clustering results. This paper, therefore, proposes a novel ordinal data clustering algorithm, which iteratively learns: 1) The partition of ordinal dataset, and 2) the inter-category distances. To the best of our knowledge, this is the first attempt to dynamically adjust inter-category distances during the clustering process to search for a better partition of ordinal data. The proposed algorithm features superior clustering accuracy, low time complexity, fast convergence, and is parameter-free. Extensive experiments show its efficacy.

2021 ◽  
Vol 8 (10) ◽  
pp. 43-50
Author(s):  
Truong et al. ◽  

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.


2014 ◽  
Vol 989-994 ◽  
pp. 2047-2050
Author(s):  
Ying Jie Wang

Data mining is the general methodology for retrieving useful information from big data. Clustering analysis is a mathematical method of classification for unsupervised machine learning. It can be adopted for data classification in Data mining. This paper combines the clustering process by fuzzy way and then deduces a special clustering algorithm with fast fuzzy c-means (FFCM) method. In summary, the paper illustrates the adoption of a series of fuzzy clustering methods in Data Mining. These methods have improved the computational efficiency with learning as the convergence speed is fast. The methodology of this paper presents significantly meaningful for information retrieval of big data.


In data mining ample techniques use distance based measures for data clustering. Improving clustering performance is the fundamental goal in cluster domain related tasks. Many techniques are available for clustering numerical data as well as categorical data. Clustering is an unsupervised learning technique and objects are grouped or clustered based on similarity among the objects. A new cluster similarity finding measure, which is cosine like cluster similarity measure (CLCSM), is proposed in this paper. The proposed cluster similarity measure is used for data classification. Extensive experiments are conducted by taking UCI machine learning datasets. The experimental results have shown that the proposed cosinelike cluster similarity measure is superior to many of the existing cluster similarity measures for data classification.


2016 ◽  
Vol 16 (6) ◽  
pp. 27-42 ◽  
Author(s):  
Minghan Yang ◽  
Xuedong Gao ◽  
Ling Li

Abstract Although Clustering Algorithm Based on Sparse Feature Vector (CABOSFV) and its related algorithms are efficient for high dimensional sparse data clustering, there exist several imperfections. Such imperfections as subjective parameter designation and order sensibility of clustering process would eventually aggravate the time complexity and quality of the algorithm. This paper proposes a parameter adjustment method of Bidirectional CABOSFV for optimization purpose. By optimizing Parameter Vector (PV) and Parameter Selection Vector (PSV) with the objective function of clustering validity, an improved Bidirectional CABOSFV algorithm using simulated annealing is proposed, which circumvents the requirement of initial parameter determination. The experiments on UCI data sets show that the proposed algorithm, which can perform multi-adjustment clustering, has a higher accurateness than single adjustment clustering, along with a decreased time complexity through iterations.


Author(s):  
Stanislav Kreuzer ◽  
Natascha Hoebel

One of the keys to building effective e-customer relationships is an understanding of consumer behavior online. However, analyzing the behavior of customers online is not necessarily an indicator of their interests. Therefore, building profiles of registered users of a website is of importance if it goes beyond collecting obvious information the user is willing to give at the time of the registration. These user profiles can contribute to the analysis of the users’ interests. Important tools for the analysis are data-mining techniques, for example, the clustering of collected user information. This chapter addresses the problem of how to define, calculate, and visualize fuzzy clusters of Web visitors with respect to their behavior and supposed interests. This chapter shows how to cluster Web users based on their profile and by their similar interests in several topics using the fuzzy and hybrid CORD (Clustering of Ordinal Data) clustering system, which is part of the Gugubarra Framework.


Algorithms ◽  
2018 ◽  
Vol 11 (11) ◽  
pp. 177 ◽  
Author(s):  
Xuedong Gao ◽  
Minghan Yang

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.


2010 ◽  
Vol 22 (1) ◽  
pp. 273-288 ◽  
Author(s):  
Florian Landis ◽  
Thomas Ott ◽  
Ruedi Stoop

We propose a Hebbian learning-based data clustering algorithm using spiking neurons. The algorithm is capable of distinguishing between clusters and noisy background data and finds an arbitrary number of clusters of arbitrary shape. These properties render the approach particularly useful for visual scene segmentation into arbitrarily shaped homogeneous regions. We present several application examples, and in order to highlight the advantages and the weaknesses of our method, we systematically compare the results with those from standard methods such as the k-means and Ward's linkage clustering. The analysis demonstrates that not only the clustering ability of the proposed algorithm is more powerful than those of the two concurrent methods, the time complexity of the method is also more modest than that of its generally used strongest competitor.


2020 ◽  
Vol 8 (6) ◽  
pp. 1973-1979

The data mining algorithms functioning is main concern, when the data becomes to a greater extent. Clustering analysis is a active and dispute research direction in the region of data mining for complex data samples. DBSCAN is a density-based clustering algorithm with several advantages in numerous applications. However, DBSCAN has quadratic time complexity i.e. making it complicated for realistic applications particularly with huge complex data samples. Therefore, this paper recommended a hybrid approach to reduce the time complexity by exploring the core properties of the DBSCAN in the initial stage using genetic based K-means partition algorithm. The technological experiments showed that the proposed hybrid approach obtains competitive results when compared with the usual approach and drastically improves the computational time.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Babacar Gaye ◽  
Dezheng Zhang ◽  
Aziguli Wulamu

With the rapid development of the Internet and the rapid development of big data analysis technology, data mining has played a positive role in promoting industry and academia. Classification is an important problem in data mining. This paper explores the background and theory of support vector machines (SVM) in data mining classification algorithms and analyzes and summarizes the research status of various improved methods of SVM. According to the scale and characteristics of the data, different solution spaces are selected, and the solution of the dual problem is transformed into the classification surface of the original space to improve the algorithm speed. Research Process. Incorporating fuzzy membership into multicore learning, it is found that the time complexity of the original problem is determined by the dimension, and the time complexity of the dual problem is determined by the quantity, and the dimension and quantity constitute the scale of the data, so it can be based on the scale of the data Features Choose different solution spaces. The algorithm speed can be improved by transforming the solution of the dual problem into the classification surface of the original space. Conclusion. By improving the calculation rate of traditional machine learning algorithms, it is concluded that the accuracy of the fitting prediction between the predicted data and the actual value is as high as 98%, which can make the traditional machine learning algorithm meet the requirements of the big data era. It can be widely used in the context of big data.


2019 ◽  
Vol 8 (4) ◽  
pp. 6036-6040

Data Mining is the foremost vital space of analysis and is pragmatically utilized in totally different domains, It becomes a highly demanding field because huge amounts of data have been collected in various applications. The database can be clustered in more number of ways depending on the clustering algorithm used, parameter settings and other factors. Multiple clustering algorithms can be combined to get the final partitioning of data which provides better clustering results. In this paper, Ensemble hybrid KMeans and DBSCAN (HDKA) algorithm has been proposed to overcome the drawbacks of DBSCAN and KMeans clustering algorithms. The performance of the proposed algorithm improves the selection of centroid points through the centroid selection strategy.For experimental results we have used two dataset Colon and Leukemia from UCI machine learning repository.


Sign in / Sign up

Export Citation Format

Share Document