Differentially Private Distance Learning in Categorical Data

AbstractMost privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.

Download Full-text

Clustering Categorical Data with k-Modes

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch040 ◽

2011 ◽

pp. 246-250 ◽

Cited By ~ 2

Author(s):

Joshua Zhexue Huang

Keyword(s):

Real World ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Chemical Information ◽

New Techniques ◽

Small Set ◽

Categorical Attributes ◽

Categorical Attribute ◽

Numeric Data

A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a small set of unique categorical values such as {Female, Male} for the gender attribute. Unlike numeric data, categorical values are discrete and unordered. Therefore, the clustering algorithms for numeric data cannot be used to cluster categorical data that exists in many real world applications. In data mining research, much effort has been put on development of new techniques for clustering categorical data (Huang, 1997b; Huang, 1998; Gibson, Kleinberg, & Raghavan, 1998; Ganti, Gehrke, & Ramakrishnan, 1999; Guha, Rastogi, & Shim, 1999; Chaturvedi, Green, Carroll, & Foods, 2001; Barbara, Li, & Couto, 2002; Andritsos, Tsaparas, Miller, & Sevcik, 2003; Li, Ma, & Ogihara, 2004; Chen, & Liu, 2005; Parmar, Wu, & Blackhurst, 2007). The k-modes clustering algorithm (Huang, 1997b; Huang, 1998) is one of the first algorithms for clustering large categorical data. In the past decade, this algorithm has been well studied and widely used in various applications. It is also adopted in commercial software (e.g., Daylight Chemical Information Systems, Inc, http://www. daylight.com/).

Download Full-text

A Novel Categorical Data Attribute Split Technique in Decision Tree Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a2568.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 1607-1612

Keyword(s):

Decision Tree ◽

Categorical Data ◽

Ease Of Use ◽

Decision Tree Learning ◽

Current Node ◽

Categorical Attributes ◽

Categorical Attribute ◽

A New Technique ◽

Class Labels ◽

Better Than

A new technique is proposed for splitting categorical data during the process of decision tree learning. This technique is based on the class probability representations and manipulations of the class labels corresponding to the distinct values of categorical attributes. For each categorical attribute aggregate similarity in terms of class probabilities is computed and then based on the highest aggregated similarity measure the best attribute is selected and then the data in the current node of the decision tree is divided into the number of sub sets equal to the number of distinct values of the best categorical split attribute. Many experiments are conducted using this proposed method and the results have shown that the proposed technique is better than many other competitive methods in terms of efficiency, ease of use, understanding, and output results and it will be useful in many modern applications.

Download Full-text

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Algorithms ◽

10.3390/a14060184 ◽

2021 ◽

Vol 14 (6) ◽

pp. 184

Author(s):

Xia Que ◽

Siyuan Jiang ◽

Jiaoyun Yang ◽

Ning An

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Similarity Measurement ◽

Amount Of Information ◽

Automatic Categorization ◽

Categorical Attributes ◽

Weighting Strategy

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Download Full-text

CLUSTERING CATEGORICAL AND NUMERICAL DATA: A NEW PROCEDURE USING MULTIDIMENSIONAL SCALING

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622003000549 ◽

2003 ◽

Vol 02 (01) ◽

pp. 135-159 ◽

Cited By ~ 12

Author(s):

SUNG-GI LEE ◽

DEOK-KYUN YUN

Keyword(s):

Multidimensional Scaling ◽

Clustering Algorithms ◽

Numerical Data ◽

Large Data ◽

Careful Analysis ◽

Mixed Data ◽

Coordinate Space ◽

Data Sets ◽

Categorical Attributes ◽

Categorical Attribute

In this paper, we present a concept based on the similarity of categorical attribute values considering implicit relationships and propose a new and effective clustering procedure for mixed data. Our procedure obtains similarities between categorical values from careful analysis and maps the values in each categorical attribute into points in two-dimensional coordinate space using multidimensional scaling. These mapped values make it possible to interpret the relationships between attribute values and to directly apply categorical attributes to clustering algorithms using a Euclidean distance. After trivial modifications, our procedure for clustering mixed data uses the k-means algorithm, well known for its efficiency in clustering large data sets. We use the familiar soybean disease and adult data sets to demonstrate the performance of our clustering procedure. The satisfactory results that we have obtained demonstrate the effectiveness of our algorithm in discovering structure in data.

Download Full-text

Statistical approach to normalization of feature vectors and clustering of mixed datasets

Proceedings of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rspa.2011.0704 ◽

2012 ◽

Vol 468 (2145) ◽

pp. 2630-2651 ◽

Cited By ~ 27

Author(s):

Maria M. Suarez-Alvarez ◽

Duc-Truong Pham ◽

Mikhail Y. Prostov ◽

Yuriy I. Prostov

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Objective Function ◽

Categorical Data ◽

Statistical Approach ◽

Objective Functions ◽

Feature Vectors ◽

Benchmark Datasets ◽

Categorical Attributes

Normalization of feature vectors of datasets is widely used in a number of fields of data mining, in particular in cluster analysis, where it is used to prevent features with large numerical values from dominating in distance-based objective functions. In this study, a unified statistical approach to normalization of all attributes of mixed databases, when different metrics are used for numerical and categorical data, is proposed. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. Formulae for the statistically normalized Minkowski mixed p -metrics are given in an explicit way. It is shown that the classic z -score standardization and the min–max normalization are particular cases of the statistical normalization, when the objective function is, respectively, based on the Euclidean or the Tchebycheff (Chebyshev) metrics. Finally, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the k -prototypes (for p =2) or another algorithm (for p ≠2).

Download Full-text

A Fast K-prototypes Algorithm Using Partial Distance Computation

10.20944/preprints201704.0099.v1 ◽

2017 ◽

Author(s):

Byoungwook KIM

Keyword(s):

Minimum Distance ◽

Clustering Algorithm ◽

Cluster Center ◽

Maximum Difference ◽

Distance Computation ◽

Computational Performance ◽

Categorical Attributes ◽

Data Objects ◽

Numeric Data ◽

Numeric Attributes

The k-means is one of the most popular and widely used clustering algorithm, however, it is limited to only numeric data. The k-prototypes algorithm is one of the famous algorithms for dealing with both numeric and categorical data. However, there have been no studies to accelerate k-prototypes algorithm. In this paper, we propose a new fast k-prototypes algorithm that gives the same answer as original k-prototypes. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with only numeric attributes. If a difference of the minimum distance and the second smallest with numeric attributes is higher than m, we can find minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental shows proposed k-prototypes algorithm improves computational performance than original k-prototypes algorithm in our dataset.

Download Full-text

Context-Based Distance Learning for Categorical Data Clustering

Advances in Intelligent Data Analysis VIII - Lecture Notes in Computer Science ◽

10.1007/978-3-642-03915-7_8 ◽

2009 ◽

pp. 83-94 ◽

Cited By ~ 15

Author(s):

Dino Ienco ◽

Ruggero G. Pensa ◽

Rosa Meo

Keyword(s):

Distance Learning ◽

Categorical Data ◽

Data Clustering ◽

Categorical Data Clustering

Download Full-text

Distance learning for categorical attribute based on context information

2010 2nd International Conference on Software Technology and Engineering ◽

10.1109/icste.2010.5608801 ◽

2010 ◽

Cited By ~ 3

Author(s):

Zeinab Khorshidpour ◽

Sattar Hashemi ◽

Ali Hamzeh

Keyword(s):

Distance Learning ◽

Context Information ◽

Categorical Attribute

Download Full-text

EVALUASI KINERJA METODE CLUSTER ENSEMBLE DAN LATENT CLASS CLUSTERING PADA PEUBAH CAMPURAN

Indonesian Journal of Statistics and Its Applications ◽

10.29244/ijsa.v4i3.630 ◽

2020 ◽

Vol 4 (3) ◽

pp. 448-461

Author(s):

Debora Chrisinta ◽

I Made Sumertajaya ◽

Indahwati Indahwati

Keyword(s):

Latent Class ◽

Probability Model ◽

Clustering Algorithms ◽

Ensemble Methods ◽

Mixed Data ◽

Cluster Ensemble ◽

Model Based Clustering ◽

Clustering Approach ◽

Categorical Attributes ◽

Numeric Data

Most of the traditional clustering algorithms are designed to focus either on numeric data or on categorical data. The collected data in the real-world often contain both numeric and categorical attributes. It is difficult for applying traditional clustering algorithms directly to these kinds of data. So, the paper aims to show the best method based on the cluster ensemble and latent class clustering approach for mixed data. Cluster ensemble is a method to combine different clustering results from two sub-datasets: the categorical and numerical variables. Then, clustering algorithms are designed for numerical and categorical datasets that are employed to produce corresponding clusters. On the other side, latent class clustering is a model-based clustering used for any type of data. The numbers of clusters base on the estimation of the probability model used. The best clustering method recommends LCC, which provides higher accuracy and the smallest standard deviation ratio. However, both LCC and cluster ensemble methods produce evaluation values that are not much different as the application method used potential village data in Bengkulu Province for clustering.

Download Full-text

An Efficient Grid-based K-prototypes Algorithm for Sustainable Decision Making Using Spatial Objects

10.20944/preprints201806.0440.v1 ◽

2018 ◽

Author(s):

Hong-Jun Jang ◽

Byoungwook Kim ◽

Jongwan Kim ◽

Soon-Young Jung

Keyword(s):

Decision Making ◽

Spatial Data ◽

Categorical Data ◽

High Performance ◽

Critical Role ◽

Bitmap Index ◽

Spatial Objects ◽

A Cell ◽

Categorical Attributes ◽

Grid Based

Data mining plays a critical role in the sustainable decision making. The k-prototypes algorithm is one of the best-known algorithm for clustering both numeric and categorical data. Despite this, however, clustering a large number of spatial object with mixed numeric and categorical attributes is still inefficient due to its high time complexity. In this paper, we propose an efficient grid-based k-prototypes algorithms, GK-prototypes, which achieves high performance for clustering spatial objects. The first proposed algorithm utilizes both maximum and minimum distance between cluster centers and a cell, which can remove unnecessary distance calculation. The second proposed algorithm as extensions of the first proposed algorithm utilizes spatial dependence that spatial data tend to be more similar as objects are closer. Each cell has a bitmap index which stores categorical values of all objects in the same cell for each attribute. This bitmap index can improve the performance in case that a categorical data is skewed. Our evaluation experiments showed that proposed algorithms can achieve better performance than the existing pruning technique in the k-prototypes algorithm.

Download Full-text