A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Download Full-text

Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient

Mathematical Problems in Engineering ◽

10.1155/2020/5143797 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Ziqi Jia ◽

Ling Song

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Experimental Results ◽

Cluster Center ◽

Real Dataset ◽

Dissimilarity Coefficient ◽

Initial Cluster ◽

Data Objects ◽

Selection Of

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.

Download Full-text

A Modified Overlapping Partitioning Clustering Algorithm for Categorical Data Clustering

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v7i1.896 ◽

2018 ◽

Vol 7 (1) ◽

pp. 55-62

Author(s):

Mohammad Alaqtash ◽

Moayad A.Fadhil ◽

Ali F. Al-Azzawi

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Numerical Data ◽

Data Representation ◽

The Past ◽

Textual Data ◽

Traditional Algorithm ◽

Clustering Problems ◽

Categorical Data Clustering

Clustering is one of the important approaches for Clustering enables the grouping of unlabeled data by partitioning data into clusters with similar patterns. Over the past decades, many clustering algorithms have been developed for various clustering problems. An overlapping partitioning clustering (OPC) algorithm can only handle numerical data. Hence, novel clustering algorithms have been studied extensively to overcome this issue. By increasing the number of objects belonging to one cluster and distance between cluster centers, the study aimed to cluster the textual data type without losing the main functions. The proposed study herein included over twenty newsgroup dataset, which consisted of approximately 20000 textual documents. By introducing some modifications to the traditional algorithm, an acceptable level of homogeneity and completeness of clusters were generated. Modifications were performed on the pre-processing phase and data representation, along with the number methods which influence the primary function of the algorithm. Subsequently, the results were evaluated and compared with the k-means algorithm of the training and test datasets. The results indicated that the modified algorithm could successfully handle the categorical data and produce satisfactory clusters.

Download Full-text

Clustering Categorical Data with k-Modes

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch040 ◽

2011 ◽

pp. 246-250 ◽

Cited By ~ 2

Author(s):

Joshua Zhexue Huang

Keyword(s):

Real World ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Chemical Information ◽

New Techniques ◽

Small Set ◽

Categorical Attributes ◽

Categorical Attribute ◽

Numeric Data

A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a small set of unique categorical values such as {Female, Male} for the gender attribute. Unlike numeric data, categorical values are discrete and unordered. Therefore, the clustering algorithms for numeric data cannot be used to cluster categorical data that exists in many real world applications. In data mining research, much effort has been put on development of new techniques for clustering categorical data (Huang, 1997b; Huang, 1998; Gibson, Kleinberg, & Raghavan, 1998; Ganti, Gehrke, & Ramakrishnan, 1999; Guha, Rastogi, & Shim, 1999; Chaturvedi, Green, Carroll, & Foods, 2001; Barbara, Li, & Couto, 2002; Andritsos, Tsaparas, Miller, & Sevcik, 2003; Li, Ma, & Ogihara, 2004; Chen, & Liu, 2005; Parmar, Wu, & Blackhurst, 2007). The k-modes clustering algorithm (Huang, 1997b; Huang, 1998) is one of the first algorithms for clustering large categorical data. In the past decade, this algorithm has been well studied and widely used in various applications. It is also adopted in commercial software (e.g., Daylight Chemical Information Systems, Inc, http://www. daylight.com/).

Download Full-text

Fast Density Clustering Algorithm for Numerical Data and Categorical Data

Mathematical Problems in Engineering ◽

10.1155/2017/6393652 ◽

2017 ◽

Vol 2017 ◽

pp. 1-15 ◽

Cited By ~ 6

Author(s):

Chen Jinyin ◽

He Huihao ◽

Chen Jungan ◽

Yu Shanqing ◽

Shi Zhaoxia

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Numerical Data ◽

Mixed Data ◽

Cluster Center ◽

Data Object ◽

Numerical Attributes ◽

Clustering Quality ◽

Categorical Attributes ◽

Density Clustering

Data objects with mixed numerical and categorical attributes are often dealt with in the real world. Most existing algorithms have limitations such as low clustering quality, cluster center determination difficulty, and initial parameter sensibility. A fast density clustering algorithm (FDCA) is put forward based on one-time scan with cluster centers automatically determined by center set algorithm (CSA). A novel data similarity metric is designed for clustering data including numerical attributes and categorical attributes. CSA is designed to choose cluster centers from data object automatically which overcome the cluster centers setting difficulty in most clustering algorithms. The performance of the proposed method is verified through a series of experiments on ten mixed data sets in comparison with several other clustering algorithms in terms of the clustering purity, the efficiency, and the time complexity.

Download Full-text

Machine Learning Based Predictive Action on Categorical Non-Sequential Data

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190417150421 ◽

2020 ◽

Vol 13 (5) ◽

pp. 1020-1030

Author(s):

Pradeep S. ◽

Jagadish S. Kallimani

Keyword(s):

Machine Learning ◽

Data Analysis ◽

Categorical Data ◽

Numerical Data ◽

Processing Technique ◽

Machine Learning Algorithms ◽

Sequential Data ◽

Industry Standard ◽

Robust Model ◽

Future Work

Background: With the advent of data analysis and machine learning, there is a growing impetus of analyzing and generating models on historic data. The data comes in numerous forms and shapes with an abundance of challenges. The most sorted form of data for analysis is the numerical data. With the plethora of algorithms and tools it is quite manageable to deal with such data. Another form of data is of categorical nature, which is subdivided into, ordinal (order wise) and nominal (number wise). This data can be broadly classified as Sequential and Non-Sequential. Sequential data analysis is easier to preprocess using algorithms. Objective: The challenge of applying machine learning algorithms on categorical data of nonsequential nature is dealt in this paper. Methods: Upon implementing several data analysis algorithms on such data, we end up getting a biased result, which makes it impossible to generate a reliable predictive model. In this paper, we will address this problem by walking through a handful of techniques which during our research helped us in dealing with a large categorical data of non-sequential nature. In subsequent sections, we will discuss the possible implementable solutions and shortfalls of these techniques. Results: The methods are applied to sample datasets available in public domain and the results with respect to accuracy of classification are satisfactory. Conclusion: The best pre-processing technique we observed in our research is one hot encoding, which facilitates breaking down the categorical features into binary and feeding it into an Algorithm to predict the outcome. The example that we took is not abstract but it is a real – time production services dataset, which had many complex variations of categorical features. Our Future work includes creating a robust model on such data and deploying it into industry standard applications.

Download Full-text

Mobile phones of paediatric hospital staff are never cleaned and commonly used in toilets with implications for healthcare nosocomial diseases

Scientific Reports ◽

10.1038/s41598-021-92360-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Matthew Olsen ◽

Anna Lohning ◽

Mariana Campos ◽

Peter Jones ◽

Simon McKirdy ◽

...

Keyword(s):

Mobile Phones ◽

Categorical Data ◽

Healthcare Workers ◽

Numerical Data ◽

Hospital Staff ◽

Paediatric Hospital ◽

Rapid Spread ◽

Chi Squared ◽

Handling Practices ◽

Micro Organisms

AbstractAn ever-increasing number of medical staff use mobile phones as a work aid, yet this may pose nosocomial diseases. To assess and report via a survey the handling practices and the use of phones by paediatric wards healthcare workers. 165 paediatric healthcare workers and staff filled in a questionnaire consisting of 14 questions (including categorical, ordinal and numerical data). Analysis of categorical data used non-parametric techniques such as the Chi-squared test. Although 98% of respondents (165 in total) report that their phones may be contaminated, 56% have never cleaned their devices. Of the respondents that clean their devices, 10% (17/165) had done so with alcohol swabs or disinfectant within that day or week; and an additional 12% respondents (20/165) within that month. Of concern, 52% (86/165) of the respondents use their phones in the bathroom, emphasising the unhygienic environments in which mobile phones/smartphones are constantly used. Disinfecting phones is a practice that only a minority of healthcare workers undertake appropriately. Mobile phones, present in billions globally, are therefore Trojan Horses if contaminated with microbes and potentially contributing to the spread and propagation of micro-organisms as per the rapid spread of SARS-CoV-2 virus in the world.

Download Full-text

An Optimal and Stable Algorithm for Clustering Numerical Data

Algorithms ◽

10.3390/a14070197 ◽

2021 ◽

Vol 14 (7) ◽

pp. 197

Author(s):

Ali Seman ◽

Azizian Mohd Sapawi

Keyword(s):

Standard Deviation ◽

Real World ◽

Clustering Algorithm ◽

Numerical Data ◽

Zero Point ◽

The Other ◽

Suitable Alternative ◽

Stable Algorithm ◽

Real World Applications

In the conventional k-means framework, seeding is the first step toward optimization before the objects are clustered. In random seeding, two main issues arise: the clustering results may be less than optimal and different clustering results may be obtained for every run. In real-world applications, optimal and stable clustering is highly desirable. This report introduces a new clustering algorithm called the zero k-approximate modal haplotype (Zk-AMH) algorithm that uses a simple and novel seeding mechanism known as zero-point multidimensional spaces. The Zk-AMH provides cluster optimality and stability, therefore resolving the aforementioned issues. Notably, the Zk-AMH algorithm yielded identical mean scores to maximum, and minimum scores in 100 runs, producing zero standard deviation to show its stability. Additionally, when the Zk-AMH algorithm was applied to eight datasets, it achieved the highest mean scores for four datasets, produced an approximately equal score for one dataset, and yielded marginally lower scores for the other three datasets. With its optimality and stability, the Zk-AMH algorithm could be a suitable alternative for developing future clustering tools.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

A Novel Cosine Similarity Like Data Clustering Method for Effective Data Classification in Data Mining

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.h6417.069820 ◽

2020 ◽

Vol 9 (8) ◽

pp. 340-346

Keyword(s):

Data Mining ◽

Similarity Measure ◽

Categorical Data ◽

Data Clustering ◽

Similarity Measures ◽

Numerical Data ◽

Data Classification ◽

Fundamental Goal ◽

Learning Technique ◽

Categorical Data Clustering

In data mining ample techniques use distance based measures for data clustering. Improving clustering performance is the fundamental goal in cluster domain related tasks. Many techniques are available for clustering numerical data as well as categorical data. Clustering is an unsupervised learning technique and objects are grouped or clustered based on similarity among the objects. A new cluster similarity finding measure, which is cosine like cluster similarity measure (CLCSM), is proposed in this paper. The proposed cluster similarity measure is used for data classification. Extensive experiments are conducted by taking UCI machine learning datasets. The experimental results have shown that the proposed cosinelike cluster similarity measure is superior to many of the existing cluster similarity measures for data classification.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text