scholarly journals Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data

2021 ◽  
Vol 11 (18) ◽  
pp. 8416
Author(s):  
Changki Lee ◽  
Uk Jung

Measuring the dissimilarity between two observations is the basis of many data mining and machine learning algorithms, and its effectiveness has a significant impact on learning outcomes. The dissimilarity or distance computation has been a manageable problem for continuous data because many numerical operations can be successfully applied. However, unlike continuous data, defining a dissimilarity between pairs of observations with categorical variables is not straightforward. This study proposes a new method to measure the dissimilarity between two categorical observations, called a context-based geodesic dissimilarity measure, for the categorical data clustering problem. The proposed method considers the relationships between categorical variables and discovers the implicit topological structures in categorical data. In other words, it can effectively reflect the nonlinear patterns of arbitrarily shaped categorical data clusters. Our experimental results confirm that the proposed measure that considers both nonlinear data patterns and relationships among the categorical variables yields better clustering performance than other distance measures.

2020 ◽  
Vol 109 (9-10) ◽  
pp. 1779-1802
Author(s):  
Morteza Haghir Chehreghani ◽  
Mostafa Haghir Chehreghani

Abstract We propose unsupervised representation learning and feature extraction from dendrograms. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies.


Author(s):  
Gi-Wook Cha ◽  
Hyeun Jun Moon ◽  
Young-Min Kim ◽  
Won-Hwa Hong ◽  
Jung-Ha Hwang ◽  
...  

Recently, artificial intelligence (AI) technologies have been employed to predict construction and demolition (C&D) waste generation. However, most studies have used machine learning models with continuous data input variables, applying algorithms, such as artificial neural networks, adaptive neuro-fuzzy inference systems, support vector machines, linear regression analysis, decision trees, and genetic algorithms. Therefore, machine learning algorithms may not perform as well when applied to categorical data. This article uses machine learning algorithms to predict C&D waste generation from a dataset, as a way to improve the accuracy of waste management in C&D facilities. These datasets include categorical (e.g., region, building structure, building use, wall material, and roofing material), and continuous data (particularly, gloss floor area), and a random forest (RF) algorithm was used. Results indicate that RF is an adequate machine learning algorithm for a small dataset consisting of categorical data, and even with a small dataset, an adequate prediction model can be developed. Despite the small dataset, the predictive performance according to the demolition waste (DW) type was R (Pearson’s correlation coefficient) = 0.691–0.871, R2 (coefficient of determination) = 0.554–0.800, showing stable prediction performance. High prediction performance was observed using three (for mortar), five (for other DW types), or six (for concrete) input variables. This study is significant because the proposed RF model can predict DW generation using a small amount of data. Additionally, it demonstrates the possibility of applying AI to multi-purpose DW management.


Entropy ◽  
2019 ◽  
Vol 21 (6) ◽  
pp. 580
Author(s):  
Albert No

We established a universality of logarithmic loss over a finite alphabet as a distortion criterion in fixed-length lossy compression. For any fixed-length lossy-compression problem under an arbitrary distortion criterion, we show that there is an equivalent lossy-compression problem under logarithmic loss. The equivalence is in the strong sense that we show that finding good schemes in corresponding lossy compression under logarithmic loss is essentially equivalent to finding good schemes in the original problem. This equivalence relation also provides an algebraic structure in the reconstruction alphabet, which allows us to use known techniques in the clustering literature. Furthermore, our result naturally suggests a new clustering algorithm in the categorical data-clustering problem.


2020 ◽  
Vol 13 (5) ◽  
pp. 1020-1030
Author(s):  
Pradeep S. ◽  
Jagadish S. Kallimani

Background: With the advent of data analysis and machine learning, there is a growing impetus of analyzing and generating models on historic data. The data comes in numerous forms and shapes with an abundance of challenges. The most sorted form of data for analysis is the numerical data. With the plethora of algorithms and tools it is quite manageable to deal with such data. Another form of data is of categorical nature, which is subdivided into, ordinal (order wise) and nominal (number wise). This data can be broadly classified as Sequential and Non-Sequential. Sequential data analysis is easier to preprocess using algorithms. Objective: The challenge of applying machine learning algorithms on categorical data of nonsequential nature is dealt in this paper. Methods: Upon implementing several data analysis algorithms on such data, we end up getting a biased result, which makes it impossible to generate a reliable predictive model. In this paper, we will address this problem by walking through a handful of techniques which during our research helped us in dealing with a large categorical data of non-sequential nature. In subsequent sections, we will discuss the possible implementable solutions and shortfalls of these techniques. Results: The methods are applied to sample datasets available in public domain and the results with respect to accuracy of classification are satisfactory. Conclusion: The best pre-processing technique we observed in our research is one hot encoding, which facilitates breaking down the categorical features into binary and feeding it into an Algorithm to predict the outcome. The example that we took is not abstract but it is a real – time production services dataset, which had many complex variations of categorical features. Our Future work includes creating a robust model on such data and deploying it into industry standard applications.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Ripsy Bondia ◽  
Pratap C. Biswal ◽  
Abinash Panda

PurposeCan something that drives our initial attention toward a stock have any implications on final decision to buy it? This paper empirically and statistically tests association, if any, between factors fostering attention toward a stock and rationales to buy it.Design/methodology/approachThis paper uses survey responses of individual investors involving multiple response categorical data. Association between attention fostering factors and rationales is tested using a modified first-order corrected Rao-Scott chi-square test statistic (to adjust for within-participant dependence among responses in case of multiple response categorical variables). Further, odds ratios and mosaic plots are used to determine the effect size of association.FindingsStrong association is seen between attention fostering factors and rationales to buy a stock. Further, strongest associations are seen in cases where origin is the same underlying influencing factor. Some of the most cited attention fostering factors and rationales in this research stem from familiarity bias and expert bias.Practical implicationsWhat starts as a trivial attention fostering factor, which may not even be recognized by majority investors, can go on to become one of the rationales for buying a stock. This can result in substantial financial implications for an individual investor. Investor education agencies and regulatory authorities can make investors cognizant of such association, which can help investors to improve and adjust their decision making accordingly.Originality/valueThe extant literature discusses factors/biases influencing buying decisions of individual investors. This research takes a step ahead by distinguishing these factors in terms of whether they play role of (1) fostering attention toward a stock or (2) of reasons for ultimately buying it. Such dissection of factors/biases, to the best of authors' knowledge, has not been done previously in any empirical and statistical analysis. The paper uses multiple response categorical data and applies a modified first-order corrected Rao-Scott chi-square statistic to test association. Application of the above-mentioned test statistic has not been done previously in context of individual investor decision-making.


In data mining ample techniques use distance based measures for data clustering. Improving clustering performance is the fundamental goal in cluster domain related tasks. Many techniques are available for clustering numerical data as well as categorical data. Clustering is an unsupervised learning technique and objects are grouped or clustered based on similarity among the objects. A new cluster similarity finding measure, which is cosine like cluster similarity measure (CLCSM), is proposed in this paper. The proposed cluster similarity measure is used for data classification. Extensive experiments are conducted by taking UCI machine learning datasets. The experimental results have shown that the proposed cosinelike cluster similarity measure is superior to many of the existing cluster similarity measures for data classification.


Sign in / Sign up

Export Citation Format

Share Document