numeric attributes
Recently Published Documents


TOTAL DOCUMENTS

46
(FIVE YEARS 1)

H-INDEX

12
(FIVE YEARS 0)

Author(s):  
Kartik Mehta ◽  
Ioana Oprea ◽  
Nikhil Rasiwasia


Author(s):  
Marvin Meeng ◽  
Arno Knobbe

Abstract Subgroup discovery (SD) is an exploratory pattern mining paradigm that comes into its own when dealing with large real-world data, which typically involves many attributes, of a mixture of data types. Essential is the ability to deal with numeric attributes, whether they concern the target (a regression setting) or the description attributes (by which subgroups are identified). Various specific algorithms have been proposed in the literature for both cases, but a systematic review of the available options is missing. This paper presents a generic framework that can be instantiated in various ways in order to create different strategies for dealing with numeric data. The bulk of the work in this paper describes an experimental comparison of a considerable range of numeric strategies in SD, where these strategies are organised according to four central dimensions. These experiments are furthermore repeated for both the classification task (target is nominal) and regression task (target is numeric), and the strategies are compared based on the quality of the top subgroup, and the quality and redundancy of the top-k result set. Results of three search strategies are compared: traditional beam search, complete search, and a variant of diverse subgroup set discovery called cover-based subgroup selection. Although there are various subtleties in the outcome of the experiments, the following general conclusions can be drawn: it is often best to determine numeric thresholds dynamically (locally), in a fine-grained manner, with binary splits, while considering multiple candidate thresholds per attribute.



2020 ◽  
Vol 7 (5) ◽  
pp. 4304-4316 ◽  
Author(s):  
Yanhong Li ◽  
Rongbo Zhu ◽  
Shiwen Mao ◽  
Ashiq Anjum




2019 ◽  
Vol 8 (3) ◽  
pp. 1555-1561

In Machine Learning, the clustering methods are the mains unsupervised methods. Their objectives is to partition a set of objects in some homogeneously groups. Clustering methods in general and more particularly Hierarchical Ascending Clustering (HAC) techniques are based on metrics and ultra-metrics. Metrics are used to evaluate the similarities between two objects; and ultra-metrics are used to estimate the similarity of two groups or the similarity of an element and a group. The main characteristic of these metrics and ultra-metrics is the fact that they are only adapted to numerical variables or can be reduced to them. With the advent of Data Mining and Data Science, most of the datasets to be analyzed contain different types of variables. In the same dataset, we can find numeric attributes, qualitative variables and free text fields very often together. Despite this diversity of variables in the same dataset, the existed clustering methods are generally build to use only an unique kind of attribute. In this paper, we propose an approach to take account different types of attributes in the same clustering method. The method proposed is a variant of HAC methods that can take into account both numerical, qualitative and textual data. Our approach is based on a metric call Phi-Similarity we developed in order to estimate the proximity of two objects, each of them is describe by a vector of attributes of different types. The developed method will be implemented with the scientific computing language R and applied to real survey data. A comparison of the results will be made with HAC techniques based on classical metrics with the Ward criterion as aggregation criteria. For classical algorithms, we will limit ourselves to the variables of the database compatible with them. This work of comparison will highlight the gain in precision in terms of classification brought by our method compared to the classic versions of HAC









Sign in / Sign up

Export Citation Format

Share Document