categorical attributes Latest Research Papers

Most of real-world datasets are of mixed type including both numeric and categorical attributes. Unlike numbers, operations on categorical values are limited, and the degree of similarity between distinct values cannot be measured directly. In order to properly analyze mixed-type data, dedicated methods to handle categorical values in the datasets are needed. The limitation of most existing methods is lack of appropriate numeric representations of categorical values. Consequently, some of analysis algorithms cannot be applied. In this paper, we address this deficiency by transforming categorical values to their numeric representation so as to facilitate various analyses of mixed-type data. In particular, the proposed transformation method preserves semantics of categorical values with respect to the other values in the dataset, resulting in better performance on data analyses including classification and clustering. The proposed method is verified and compared with other methods on extensive real-world datasets.

Download Full-text

An Extended Mondrian Algorithm – XMondrian to Protect Identity Disclosure

10.3233/apc210088 ◽

2021 ◽

Author(s):

R. Padmaja ◽

V. Santhi

Keyword(s):

Research Area ◽

Data Encryption ◽

Vital Role ◽

Data Publishing ◽

Extended Version ◽

Data Anonymization ◽

Identity Disclosure ◽

Categorical Attributes ◽

Privacy Preserving Data Publishing ◽

Day By Day

In recent days, Privacy Preserving Data Publishing (PPDP) is considered as vital research area due to rapid increasing rate of data being published in the Internet day by day. Many Organizations often need to publish their data in internet for research and analysis purpose, but there is no guarantee that those data would be used only for ethical purposes. Hence data anonymization comes into picture and play a vital role in preventing identity disclosure, also it restricts the amount of data that can be seen or used by the external users. It is an extensively used PPDP technique among data encryption, data anonymization and data perturbation methods. Mondrian is considered as one such data anonymization technique that has outperformed compare to many anonymization algorithms, because of its fast and scalable nature. However, the algorithm insists to encode the categorical values into numerical values and decode it, to generalize the data. To overcome this problem, a new extended version of Mondrian algorithm is proposed, and it is called XMondrian algorithm. The proposed algorithm can handle both numerical and categorical attributes without encoding or decoding the categorical values.The effectiveness of the proposed algorithm has been analysed through experimental study and observed that the proposed XMondrian algorithm outshine the existing Mondrian algorithm in terms of anonymization time and Cavg. Cavg is one of the metric used to quantify the utility of data.

Download Full-text

A Discrete Crow Search Algorithm for Mining Quantitative Association Rules

International Journal of Swarm Intelligence Research ◽

10.4018/ijsir.2021100106 ◽

2021 ◽

Vol 12 (4) ◽

pp. 101-124

Author(s):

Makhlouf Ledmi ◽

Hamouma Moumen ◽

Abderrahim Siam ◽

Hichem Haouassi ◽

Nabil Azizi

Keyword(s):

Association Rules ◽

Heuristic Algorithms ◽

Search Algorithm ◽

Discrete Version ◽

Numerical Attributes ◽

Quantitative Association Rules ◽

Discretization Algorithm ◽

Continuous Problems ◽

Categorical Attributes ◽

Real World Datasets

Association rules are the specific data mining methods aiming to discover explicit relations between the different attributes in a large dataset. However, in reality, several datasets may contain both numeric and categorical attributes. Recently, many meta-heuristic algorithms that mimic the nature are developed for solving continuous problems. This article proposes a new algorithm, DCSA-QAR, for mining quantitative association rules based on crow search algorithm (CSA). To accomplish this, new operators are defined to increase the ability to explore the searching space and ensure the transition from the continuous to the discrete version of CSA. Moreover, a new discretization algorithm is adopted for numerical attributes taking into account dependencies probably that exist between attributes. Finally, to evaluate the performance, DCSA-QAR is compared with particle swarm optimization and mono and multi-objective evolutionary approaches for mining association rules. The results obtained over real-world datasets show the outstanding performance of DCSA-QAR in terms of quality measures.

Download Full-text

Making Use of Functional Dependencies Based on Data to Find Better Classification Trees

International Journal of Circuits, Systems and Signal Processing ◽

10.46300/9106.2021.15.160 ◽

2021 ◽

Vol 15 ◽

pp. 1475-1485

Author(s):

Hyontai Sug

Keyword(s):

Machine Learning ◽

Data Mining ◽

Decision Trees ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Functional Dependencies ◽

Chi Square ◽

Chi Square Test ◽

Novel Method ◽

Categorical Attributes

For the classification task of machine learning algorithms independency between conditional attributes is a precondition for success of data mining. On the other hand, decision trees are one of the mostly used machine learning algorithms because of their good understandability. So, because dependency between conditional attributes can cause more complex trees, supplying conditional attributes independent each other is very important, the requirement of conditional attributes for decision trees as well as other machine learning algorithms is that they are independent each other and dependent on decisional attributes only. Statistical method to check independence between attributes is Chi-square test, but the test can be effective for categorical attributes only. So, the applicability of Chi-square test is limited, because most datasets for data mining have mixed attributes of categorical and numerical. In order to overcome the problem, and as a way to test dependency between conditional attributes, a novel method based on functional dependency based on data that can be applied to any datasets irrespective of data type of attributes is suggested. After removing highly dependent attributes between conditional attributes, we can generate better decision trees. Experiments were performed to show that the method is effective, and the experiments showed very good results.

Download Full-text

Mitigating sparsity using Bhattacharyya Coefficient and items’ categorical attributes: improving the performance of collaborative filtering based recommendation systems

Applied Intelligence ◽

10.1007/s10489-021-02462-8 ◽

2021 ◽

Author(s):

Pradeep Kumar Singh ◽

Pijush Kanti Dutta Pramanik ◽

Prasenjit Choudhury

Keyword(s):

Collaborative Filtering ◽

Recommendation Systems ◽

Bhattacharyya Coefficient ◽

Categorical Attributes

Download Full-text

Differentially Private Distance Learning in Categorical Data

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00778-0 ◽

2021 ◽

Author(s):

Elena Battaglia ◽

Simone Celano ◽

Ruggero G. Pensa

Keyword(s):

Distance Learning ◽

Private Information ◽

Categorical Data ◽

Health Records ◽

Limited Applicability ◽

Clustering And Classification ◽

Categorical Attributes ◽

Categorical Attribute ◽

Numeric Data ◽

Privacy Budget

AbstractMost privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.

Download Full-text

Proximity Measurement for Hierarchical Categorical Attributes in Big Data

Security and Communication Networks ◽

10.1155/2021/6612923 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Zakariae El Ouazzani ◽

An Braeken ◽

Hanan El Bakkali

Keyword(s):

Big Data ◽

Information Loss ◽

Sensitive Information ◽

Hybrid Technique ◽

Hierarchical Data ◽

Data Utility ◽

Large Databases ◽

Categorical Attributes ◽

Diversity Principle ◽

Diversity Technique

Nearly most of the organizations store massive amounts of data in large databases for research, statistics, and mining purposes. In most cases, much of the accumulated data contain sensitive information belonging to individuals which may breach privacy. Hence, ensuring privacy in big data is considered a very important issue. The concept of privacy aims to protect sensitive information from various attacks that may violate the identity of individuals. Anonymization techniques are considered the best way to ensure privacy in big data. Various works have been already realized, taking into account horizontal clustering. The L-diversity technique is one of those techniques dealing with sensitive numerical and categorical attributes. However, the majority of anonymization techniques using L-diversity principle for hierarchical data cannot resist the similarity attack and therefore cannot ensure privacy carefully. In order to prevent the similarity attack while preserving data utility, a hybrid technique dealing with categorical attributes is proposed in this paper. Furthermore, we highlighted all the steps of our proposed algorithm with detailed comments. Moreover, the algorithm is implemented and evaluated according to a well-known information loss-based criterion which is Normalized Certainty Penalty (NCP). The obtained results show a good balance between privacy and data utility.

Download Full-text

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Algorithms ◽

10.3390/a14060184 ◽

2021 ◽

Vol 14 (6) ◽

pp. 184

Author(s):

Xia Que ◽

Siyuan Jiang ◽

Jiaoyun Yang ◽

Ning An

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Similarity Measurement ◽

Amount Of Information ◽

Automatic Categorization ◽

Categorical Attributes ◽

Weighting Strategy

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Download Full-text

Diagnosis of schizotypal disorder: reliability of categorial approach or validity of dimensional one?

Neurology Bulletin ◽

10.17816/nb58706 ◽

2021 ◽

Vol LIII (1) ◽

pp. 71-75

Author(s):

Aleksandr P. Kotsiubinskii ◽

Yulia V. Isaenko

Keyword(s):

Mental Status ◽

Holistic Approach ◽

Epidemiological Research ◽

Functional Diagnosis ◽

Advantages And Disadvantages ◽

Diagnostic Approaches ◽

Categorical Attributes ◽

The Relationship ◽

Schizotypal Disorder ◽

Scientific Base

The article examines the problem of modern classifications used in psychiatry (ICD, DSM) that are devoid of the fundamental scientific base and primarily serve the goals of statistics and epidemiological research. In this context researchers are increasingly interested in the relationship between the two diagnostic approaches: categorical and dimensional. It is noted that each of these approaches has its own advantages and disadvantages. In this context it is proposed to take into consideration while making a diagnosis of schizotypal disorder not only typological (categorical) attributes but also dementia characteristics of the patients mental status, which, in the holistic approach, should be compared with the psychological, social and functional diagnosis.

Download Full-text

Efficient Learning From Two-Class Categorical Imbalanced Healthcare Data

International Journal of Healthcare Information Systems and Informatics ◽

10.4018/ijhisi.2021010105 ◽

2021 ◽

Vol 16 (1) ◽

pp. 81-100

Author(s):

Lincy Mathews ◽

HariSeetha

Keyword(s):

Categorical Data ◽

Class Imbalance ◽

Similarity Measures ◽

Healthcare Sector ◽

Healthcare Data ◽

Data Segment ◽

Health Related ◽

Categorical Attributes ◽

Real World Datasets ◽

Efficient Learning

When data classes are differently represented in one v. other data segment to be mined, it generates the imbalanced two-class data challenge. Many health-related datasets comprising categorical data are faced with the class imbalance challenge. This paper aims to address the limitations of imbalanced two-class categorical data and presents a re-sampling solution known as ‘Syn_Gen_Min' (SGM) to improve the class imbalance ratio. SGM involves finding the greedy neighbors for a given minority sample. To the best of one's knowledge, the accepted approach for a classifier is to find the numeric equivalence for categorical attributes, resulting in the loss of information. The novelty of this contribution is that the categorical attributes are kept in their raw form. Five distinct categorical similarity measures are employed and tested against six real-world datasets derived within the healthcare sector. The application of these similarity methods leads to the generation of different synthetic samples, which has significantly improved the performance measures of the classifier. This work further proves that there is no generic similarity measure that fits all datasets.

Download Full-text

categorical attributes
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Analyzing mixed-type data by using word embedding for handling categorical features

An Extended Mondrian Algorithm – XMondrian to Protect Identity Disclosure

A Discrete Crow Search Algorithm for Mining Quantitative Association Rules

Making Use of Functional Dependencies Based on Data to Find Better Classification Trees

Mitigating sparsity using Bhattacharyya Coefficient and items’ categorical attributes: improving the performance of collaborative filtering based recommendation systems

Differentially Private Distance Learning in Categorical Data

Proximity Measurement for Hierarchical Categorical Attributes in Big Data

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Diagnosis of schizotypal disorder: reliability of categorial approach or validity of dimensional one?

Efficient Learning From Two-Class Categorical Imbalanced Healthcare Data

Export Citation Format

categorical attributesRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Analyzing mixed-type data by using word embedding for handling categorical features

An Extended Mondrian Algorithm – XMondrian to Protect Identity Disclosure

A Discrete Crow Search Algorithm for Mining Quantitative Association Rules

Making Use of Functional Dependencies Based on Data to Find Better Classification Trees

Mitigating sparsity using Bhattacharyya Coefficient and items’ categorical attributes: improving the performance of collaborative filtering based recommendation systems

Differentially Private Distance Learning in Categorical Data

Proximity Measurement for Hierarchical Categorical Attributes in Big Data

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Diagnosis of schizotypal disorder: reliability of categorial approach or validity of dimensional one?

Efficient Learning From Two-Class Categorical Imbalanced Healthcare Data

categorical attributes
Recently Published Documents