Efficient binary embedding of categorical data using BinSketch

Background: With the advent of data analysis and machine learning, there is a growing impetus of analyzing and generating models on historic data. The data comes in numerous forms and shapes with an abundance of challenges. The most sorted form of data for analysis is the numerical data. With the plethora of algorithms and tools it is quite manageable to deal with such data. Another form of data is of categorical nature, which is subdivided into, ordinal (order wise) and nominal (number wise). This data can be broadly classified as Sequential and Non-Sequential. Sequential data analysis is easier to preprocess using algorithms. Objective: The challenge of applying machine learning algorithms on categorical data of nonsequential nature is dealt in this paper. Methods: Upon implementing several data analysis algorithms on such data, we end up getting a biased result, which makes it impossible to generate a reliable predictive model. In this paper, we will address this problem by walking through a handful of techniques which during our research helped us in dealing with a large categorical data of non-sequential nature. In subsequent sections, we will discuss the possible implementable solutions and shortfalls of these techniques. Results: The methods are applied to sample datasets available in public domain and the results with respect to accuracy of classification are satisfactory. Conclusion: The best pre-processing technique we observed in our research is one hot encoding, which facilitates breaking down the categorical features into binary and feeding it into an Algorithm to predict the outcome. The example that we took is not abstract but it is a real – time production services dataset, which had many complex variations of categorical features. Our Future work includes creating a robust model on such data and deploying it into industry standard applications.

Download Full-text

Low Dimensional Representation of Space Structure and Clustering of Categorical Data

2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom) ◽

10.1109/bdcloud.2018.00161 ◽

2018 ◽

Author(s):

Jianjun Cao ◽

Qibin Zheng ◽

Nianfeng Weng ◽

Xingchun Diao

Keyword(s):

Categorical Data ◽

Space Structure ◽

Dimensional Representation ◽

Representation Of Space ◽

Low Dimensional

Download Full-text

Bayesian Models for Categorical Data

10.1002/0470092394 ◽

2005 ◽

Cited By ~ 103

Author(s):

Peter Congdon

Keyword(s):

Categorical Data ◽

Bayesian Models

Download Full-text

Categorical Data Analysis for Geographers and Environmental Scientists

Economic Geography ◽

10.2307/144098 ◽

1986 ◽

Vol 62 (2) ◽

pp. 192 ◽

Cited By ~ 1

Author(s):

Joel L. Horowitz ◽

Neil Wrigley

Keyword(s):

Data Analysis ◽

Categorical Data ◽

Categorical Data Analysis

Download Full-text

Multivariate analysis of categorical data with applications to road safety research. Accid. Anal. & Prev. 1, 217–221.

Accident Analysis & Prevention ◽

10.1016/0001-4575(69)90045-1 ◽

1969 ◽

Vol 1 (3) ◽

pp. 307

Author(s):

M.J. Koornstra

Keyword(s):

Multivariate Analysis ◽

Categorical Data ◽

Road Safety ◽

Safety Research

Download Full-text

Review of "Analyzing Qualitative/Categorical Data: Log-Linear Models and Latent-Structure Analysis, by Leo A. Goodman", Abt Books, 1978

ACM SIGSIM Simulation Digest ◽

10.1145/1102815.1102830 ◽

1979 ◽

Vol 10 (4) ◽

pp. 69-69

Keyword(s):

Structure Analysis ◽

Categorical Data ◽

Linear Models ◽

Latent Structure ◽

Latent Structure Analysis ◽

Log Linear ◽

Data Log

Download Full-text

Deep Neural Networks Classification via Binary Error-Detecting Output Codes

Applied Sciences ◽

10.3390/app11083563 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3563

Author(s):

Martin Klimo ◽

Peter Lukáč ◽

Peter Tarábek

Keyword(s):

Neural Networks ◽

Categorical Data ◽

Coding Theory ◽

Hamming Distance ◽

Activation Function ◽

Block Codes ◽

Ease Of Use ◽

Linear Block Codes ◽

Binary Coding ◽

The One

One-hot encoding is the prevalent method used in neural networks to represent multi-class categorical data. Its success stems from its ease of use and interpretability as a probability distribution when accompanied by a softmax activation function. However, one-hot encoding leads to very high dimensional vector representations when the categorical data’s cardinality is high. The Hamming distance in one-hot encoding is equal to two from the coding theory perspective, which does not allow detection or error-correcting capabilities. Binary coding provides more possibilities for encoding categorical data into the output codes, which mitigates the limitations of the one-hot encoding mentioned above. We propose a novel method based on Zadeh fuzzy logic to train binary output codes holistically. We study linear block codes for their possibility of separating class information from the checksum part of the codeword, showing their ability not only to detect recognition errors by calculating non-zero syndrome, but also to evaluate the truth-value of the decision. Experimental results show that the proposed approach achieves similar results as one-hot encoding with a softmax function in terms of accuracy, reliability, and out-of-distribution performance. It suggests a good foundation for future applications, mainly classification tasks with a high number of classes.

Download Full-text