categorical attributes
Recently Published Documents


TOTAL DOCUMENTS

100
(FIVE YEARS 36)

H-INDEX

14
(FIVE YEARS 2)

2021 ◽  
Vol 25 (6) ◽  
pp. 1349-1368
Author(s):  
Chung-Chian Hsu ◽  
Wei-Cyun Tsao ◽  
Arthur Chang ◽  
Chuan-Yu Chang

Most of real-world datasets are of mixed type including both numeric and categorical attributes. Unlike numbers, operations on categorical values are limited, and the degree of similarity between distinct values cannot be measured directly. In order to properly analyze mixed-type data, dedicated methods to handle categorical values in the datasets are needed. The limitation of most existing methods is lack of appropriate numeric representations of categorical values. Consequently, some of analysis algorithms cannot be applied. In this paper, we address this deficiency by transforming categorical values to their numeric representation so as to facilitate various analyses of mixed-type data. In particular, the proposed transformation method preserves semantics of categorical values with respect to the other values in the dataset, resulting in better performance on data analyses including classification and clustering. The proposed method is verified and compared with other methods on extensive real-world datasets.


2021 ◽  
Author(s):  
R. Padmaja ◽  
V. Santhi

In recent days, Privacy Preserving Data Publishing (PPDP) is considered as vital research area due to rapid increasing rate of data being published in the Internet day by day. Many Organizations often need to publish their data in internet for research and analysis purpose, but there is no guarantee that those data would be used only for ethical purposes. Hence data anonymization comes into picture and play a vital role in preventing identity disclosure, also it restricts the amount of data that can be seen or used by the external users. It is an extensively used PPDP technique among data encryption, data anonymization and data perturbation methods. Mondrian is considered as one such data anonymization technique that has outperformed compare to many anonymization algorithms, because of its fast and scalable nature. However, the algorithm insists to encode the categorical values into numerical values and decode it, to generalize the data. To overcome this problem, a new extended version of Mondrian algorithm is proposed, and it is called XMondrian algorithm. The proposed algorithm can handle both numerical and categorical attributes without encoding or decoding the categorical values.The effectiveness of the proposed algorithm has been analysed through experimental study and observed that the proposed XMondrian algorithm outshine the existing Mondrian algorithm in terms of anonymization time and Cavg. Cavg is one of the metric used to quantify the utility of data.


2021 ◽  
Vol 12 (4) ◽  
pp. 101-124
Author(s):  
Makhlouf Ledmi ◽  
Hamouma Moumen ◽  
Abderrahim Siam ◽  
Hichem Haouassi ◽  
Nabil Azizi

Association rules are the specific data mining methods aiming to discover explicit relations between the different attributes in a large dataset. However, in reality, several datasets may contain both numeric and categorical attributes. Recently, many meta-heuristic algorithms that mimic the nature are developed for solving continuous problems. This article proposes a new algorithm, DCSA-QAR, for mining quantitative association rules based on crow search algorithm (CSA). To accomplish this, new operators are defined to increase the ability to explore the searching space and ensure the transition from the continuous to the discrete version of CSA. Moreover, a new discretization algorithm is adopted for numerical attributes taking into account dependencies probably that exist between attributes. Finally, to evaluate the performance, DCSA-QAR is compared with particle swarm optimization and mono and multi-objective evolutionary approaches for mining association rules. The results obtained over real-world datasets show the outstanding performance of DCSA-QAR in terms of quality measures.


Author(s):  
Hyontai Sug

For the classification task of machine learning algorithms independency between conditional attributes is a precondition for success of data mining. On the other hand, decision trees are one of the mostly used machine learning algorithms because of their good understandability. So, because dependency between conditional attributes can cause more complex trees, supplying conditional attributes independent each other is very important, the requirement of conditional attributes for decision trees as well as other machine learning algorithms is that they are independent each other and dependent on decisional attributes only. Statistical method to check independence between attributes is Chi-square test, but the test can be effective for categorical attributes only. So, the applicability of Chi-square test is limited, because most datasets for data mining have mixed attributes of categorical and numerical. In order to overcome the problem, and as a way to test dependency between conditional attributes, a novel method based on functional dependency based on data that can be applied to any datasets irrespective of data type of attributes is suggested. After removing highly dependent attributes between conditional attributes, we can generate better decision trees. Experiments were performed to show that the method is effective, and the experiments showed very good results.


Author(s):  
Elena Battaglia ◽  
Simone Celano ◽  
Ruggero G. Pensa

AbstractMost privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.


2021 ◽  
Vol 2021 ◽  
pp. 1-17
Author(s):  
Zakariae El Ouazzani ◽  
An Braeken ◽  
Hanan El Bakkali

Nearly most of the organizations store massive amounts of data in large databases for research, statistics, and mining purposes. In most cases, much of the accumulated data contain sensitive information belonging to individuals which may breach privacy. Hence, ensuring privacy in big data is considered a very important issue. The concept of privacy aims to protect sensitive information from various attacks that may violate the identity of individuals. Anonymization techniques are considered the best way to ensure privacy in big data. Various works have been already realized, taking into account horizontal clustering. The L-diversity technique is one of those techniques dealing with sensitive numerical and categorical attributes. However, the majority of anonymization techniques using L-diversity principle for hierarchical data cannot resist the similarity attack and therefore cannot ensure privacy carefully. In order to prevent the similarity attack while preserving data utility, a hybrid technique dealing with categorical attributes is proposed in this paper. Furthermore, we highlighted all the steps of our proposed algorithm with detailed comments. Moreover, the algorithm is implemented and evaluated according to a well-known information loss-based criterion which is Normalized Certainty Penalty (NCP). The obtained results show a good balance between privacy and data utility.


Algorithms ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 184
Author(s):  
Xia Que ◽  
Siyuan Jiang ◽  
Jiaoyun Yang ◽  
Ning An

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.


2021 ◽  
Vol LIII (1) ◽  
pp. 71-75
Author(s):  
Aleksandr P. Kotsiubinskii ◽  
Yulia V. Isaenko

The article examines the problem of modern classifications used in psychiatry (ICD, DSM) that are devoid of the fundamental scientific base and primarily serve the goals of statistics and epidemiological research. In this context researchers are increasingly interested in the relationship between the two diagnostic approaches: categorical and dimensional. It is noted that each of these approaches has its own advantages and disadvantages. In this context it is proposed to take into consideration while making a diagnosis of schizotypal disorder not only typological (categorical) attributes but also dementia characteristics of the patients mental status, which, in the holistic approach, should be compared with the psychological, social and functional diagnosis.


Author(s):  
Lincy Mathews ◽  
HariSeetha

When data classes are differently represented in one v. other data segment to be mined, it generates the imbalanced two-class data challenge. Many health-related datasets comprising categorical data are faced with the class imbalance challenge. This paper aims to address the limitations of imbalanced two-class categorical data and presents a re-sampling solution known as ‘Syn_Gen_Min' (SGM) to improve the class imbalance ratio. SGM involves finding the greedy neighbors for a given minority sample. To the best of one's knowledge, the accepted approach for a classifier is to find the numeric equivalence for categorical attributes, resulting in the loss of information. The novelty of this contribution is that the categorical attributes are kept in their raw form. Five distinct categorical similarity measures are employed and tested against six real-world datasets derived within the healthcare sector. The application of these similarity methods leads to the generation of different synthetic samples, which has significantly improved the performance measures of the classifier. This work further proves that there is no generic similarity measure that fits all datasets.


Sign in / Sign up

Export Citation Format

Share Document