A Novel Categorical Data Attribute Split Technique in Decision Tree Learning

A new technique is proposed for splitting categorical data during the process of decision tree learning. This technique is based on the class probability representations and manipulations of the class labels corresponding to the distinct values of categorical attributes. For each categorical attribute aggregate similarity in terms of class probabilities is computed and then based on the highest aggregated similarity measure the best attribute is selected and then the data in the current node of the decision tree is divided into the number of sub sets equal to the number of distinct values of the best categorical split attribute. Many experiments are conducted using this proposed method and the results have shown that the proposed technique is better than many other competitive methods in terms of efficiency, ease of use, understanding, and output results and it will be useful in many modern applications.

Download Full-text

Multilabel Classification of Membrane Protein in Human by Decision Tree (DT) Approach

Biomedical & Pharmacology Journal ◽

10.13005/bpj/1353 ◽

2018 ◽

Vol 11 (1) ◽

pp. 113-121 ◽

Cited By ~ 1

Author(s):

Nijil Raj N ◽

T. Mahalekshmi

Keyword(s):

Membrane Protein ◽

Decision Tree ◽

Protein Function ◽

Cellular Processes ◽

Semantic Scene ◽

Class Labels ◽

Essential Set ◽

Type Classification ◽

Better Than

Multi-label classification methods are important in various fields,such as protein type,protein function, semantic scene classification and music categorization . In multi-label classification, each sample can be associated with a set of class labels. In protein type classification, one of the major types of protein is membrane protein. The Membrane proteins are performing different cellular processes and important functions, which are based on the protein types. Each membrane protein have different rolls at the same time. In this study we proposes membrane protein type classification using Decision Tree (DT) classification algorithm. The DT classifies a membrane protein into six types . An essential set of features are extracted from the membrane protein dataset S1 which are used for the proposed method,and it was revealed an accuracy of 69.81%, whereas existing methods network based and shortest path revealed an accuracy of 66.78%,54.97%.The accuracy got in the existing methods are not for the full set of protein in dataset S1, but it is achieved after removal of few unannotated protein. Both accuracy wise and complexity wise, the proposed method seems to be better than the existing method

Download Full-text

Clustering Categorical Data with k-Modes

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch040 ◽

2011 ◽

pp. 246-250 ◽

Cited By ~ 2

Author(s):

Joshua Zhexue Huang

Keyword(s):

Real World ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Chemical Information ◽

New Techniques ◽

Small Set ◽

Categorical Attributes ◽

Categorical Attribute ◽

Numeric Data

A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a small set of unique categorical values such as {Female, Male} for the gender attribute. Unlike numeric data, categorical values are discrete and unordered. Therefore, the clustering algorithms for numeric data cannot be used to cluster categorical data that exists in many real world applications. In data mining research, much effort has been put on development of new techniques for clustering categorical data (Huang, 1997b; Huang, 1998; Gibson, Kleinberg, & Raghavan, 1998; Ganti, Gehrke, & Ramakrishnan, 1999; Guha, Rastogi, & Shim, 1999; Chaturvedi, Green, Carroll, & Foods, 2001; Barbara, Li, & Couto, 2002; Andritsos, Tsaparas, Miller, & Sevcik, 2003; Li, Ma, & Ogihara, 2004; Chen, & Liu, 2005; Parmar, Wu, & Blackhurst, 2007). The k-modes clustering algorithm (Huang, 1997b; Huang, 1998) is one of the first algorithms for clustering large categorical data. In the past decade, this algorithm has been well studied and widely used in various applications. It is also adopted in commercial software (e.g., Daylight Chemical Information Systems, Inc, http://www. daylight.com/).

Download Full-text

RECOGNITION OF HAND-PRINTED LATIN CHARACTERS BASED ON GENERALIZED HOUGH TRANSFORM AND DECISION TREE LEARNING TECHNIQUES

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001400000246 ◽

2000 ◽

Vol 14 (03) ◽

pp. 369-387 ◽

Cited By ~ 5

Author(s):

ADNAN AMIN

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Hough Transform ◽

Large Degree ◽

New Technique ◽

Decision Tree Learning ◽

Generalized Hough Transform ◽

Learning Techniques ◽

Base Form ◽

A New Technique

This paper presents a new technique for the recognition of hand-printed Latin characters using machine learning. Conventional methods have relied on manually constructed dictionaries which are not only tedious to construct but also difficult to make tolerant to variation in writing styles. The advantages of machine learning are that it can generalize over a large degree of variation between writing styles, and recognition rules can be constructed by example. Characters are scanned into the computer and preprocessing techniques transform the bit-map representation of the characters into a set of primitives which can be represented in an attribute base form. A set of such representations for each character is then input to C4.5 which produces a decision tree for classifying each character.

Download Full-text

Differentially Private Distance Learning in Categorical Data

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00778-0 ◽

2021 ◽

Author(s):

Elena Battaglia ◽

Simone Celano ◽

Ruggero G. Pensa

Keyword(s):

Distance Learning ◽

Private Information ◽

Categorical Data ◽

Health Records ◽

Limited Applicability ◽

Clustering And Classification ◽

Categorical Attributes ◽

Categorical Attribute ◽

Numeric Data ◽

Privacy Budget

AbstractMost privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.

Download Full-text

An Extended C4.5 Classification Algorithm using Mathematical Series

Science & Technology Journal ◽

10.22232/stj.2019.07.02.06 ◽

2019 ◽

Vol 7 (2) ◽

pp. 54-59

Author(s):

R. Raja Aswathi ◽

◽

K. Pazhani Kumar ◽

B. Ramakrishnan

Keyword(s):

Decision Tree ◽

Categorical Data ◽

Missing Values ◽

Classification Algorithm ◽

New Technique ◽

Rule Based ◽

C4.5 Algorithm ◽

A New Technique

The algorithm C4.5 is an efficient decision tree based classification, which is derived from the ID3 approach. C4.5 is also a rule based classification algorithm. The main importance of the C4.5 algorithm is that it can deal with categorical data, over fitting of data and handling of missing values. The performance of C4.5 is superior to ID3 even with equal number of attributes. The EC4.5 (Exponential C4.5) is an extension of C4.5 algorithm which uses exponential of split value to predict the gain of attributes and handled the set back reported in C4.5. However the EC4.5 has some misclassification of data and to avoid this problem a new technique is introduced. This paper proposes a proficient technique TMC4.5 (Taylor-Madhava C4.5) to reduce the uncertainty in classification of data by integrating an exponential split value in EC4.5 and sin splitting value derived from the Madhava series. By using this technique an optimized gain value is obtained that reduces uncertainty. From the obtained result the TMC4.5 has far better results than the C4.5 and EC4.5 algorithms.

Download Full-text

Malicious URLs Detection Using Decision Tree Classifiers and Majority Voting Technique

Cybernetics and Information Technologies ◽

10.2478/cait-2018-0002 ◽

2018 ◽

Vol 18 (1) ◽

pp. 11-29 ◽

Cited By ~ 5

Author(s):

Dharmaraj R. Patil ◽

J. B. Patil

Keyword(s):

Decision Tree ◽

False Positive Rate ◽

False Negative ◽

False Negative Rate ◽

Majority Voting ◽

Detection Accuracy ◽

Decision Tree Learning ◽

The Arts ◽

Positive Rate ◽

Better Than

Abstract Researchers all over the world have provided significant and effective solutions to detect malicious URLs. Still due to the ever changing nature of cyberattacks, there are many open issues. In this paper, we have provided an effective hybrid methodology with new features to deal with this problem. To evaluate our approach, we have used state-of-the-arts supervised decision tree learning classifications models. We have performed our experiments on the balanced dataset. The experimental results show that, by inclusion of new features all the decision tree learning classifiers work well on our labeled dataset, achieving 98-99% detection accuracy with very low False Positive Rate (FPR) and False Negative Rate (FNR). Also we have achieved 99.29% detection accuracy with very low FPR and FNR using majority voting technique, which is better than the wellknown anti-virus and anti-malware solutions.

Download Full-text

Learning the Past Tense of English Verbs: The Symbolic Pattern Associator vs. Connectionist Models

Journal of Artificial Intelligence Research ◽

10.1613/jair.39 ◽

1994 ◽

Vol 1 ◽

pp. 209-229 ◽

Cited By ~ 14

Author(s):

C. X. Ling

Keyword(s):

Decision Tree ◽

Cognitive Modeling ◽

Learning Algorithm ◽

General Purpose ◽

Past Tense ◽

Decision Tree Learning ◽

Wide Margin ◽

The Past ◽

Ann Models ◽

Better Than

Learning the past tense of English verbs - a seemingly minor aspect of language acquisition - has generated heated debates since 1986, and has become a landmark task for testing the adequacy of cognitive modeling. Several artificial neural networks (ANNs) have been implemented, and a challenge for better symbolic models has been posed. In this paper, we present a general-purpose Symbolic Pattern Associator (SPA) based upon the decision-tree learning algorithm ID3. We conduct extensive head-to-head comparisons on the generalization ability between ANN models and the SPA under different representations. We conclude that the SPA generalizes the past tense of unseen verbs better than ANN models by a wide margin, and we offer insights as to why this should be the case. We also discuss a new default strategy for decision-tree learning algorithms.

Download Full-text

DIFFERENTIAL EVOLUTION IN THE DECISION TREE LEARNING ALGORITHM

Siberian Journal of Science and Technology ◽

10.31772/2587-6066-2019-20-3-312-319 ◽

2019 ◽

Vol 20 (3) ◽

pp. 312-319

Author(s):

S. A. Mitrofanov ◽

◽

E. S. Semenkin ◽

Keyword(s):

Decision Tree ◽

Differential Evolution ◽

Learning Algorithm ◽

Decision Tree Learning

Download Full-text

Applying design patterns to decision tree learning system

ACM SIGSOFT Software Engineering Notes ◽

10.1145/291252.288279 ◽

1998 ◽

Vol 23 (6) ◽

pp. 111-120 ◽

Cited By ~ 1

Author(s):

Gou Masuda ◽

Norihiro Sakamoto ◽

Kazuo Ushijima

Keyword(s):

Decision Tree ◽

Design Patterns ◽

Learning System ◽

Decision Tree Learning

Download Full-text

Deep Neural Networks Classification via Binary Error-Detecting Output Codes

Applied Sciences ◽

10.3390/app11083563 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3563

Author(s):

Martin Klimo ◽

Peter Lukáč ◽

Peter Tarábek

Keyword(s):

Neural Networks ◽

Categorical Data ◽

Coding Theory ◽

Hamming Distance ◽

Activation Function ◽

Block Codes ◽

Ease Of Use ◽

Linear Block Codes ◽

Binary Coding ◽

The One

One-hot encoding is the prevalent method used in neural networks to represent multi-class categorical data. Its success stems from its ease of use and interpretability as a probability distribution when accompanied by a softmax activation function. However, one-hot encoding leads to very high dimensional vector representations when the categorical data’s cardinality is high. The Hamming distance in one-hot encoding is equal to two from the coding theory perspective, which does not allow detection or error-correcting capabilities. Binary coding provides more possibilities for encoding categorical data into the output codes, which mitigates the limitations of the one-hot encoding mentioned above. We propose a novel method based on Zadeh fuzzy logic to train binary output codes holistically. We study linear block codes for their possibility of separating class information from the checksum part of the codeword, showing their ability not only to detect recognition errors by calculating non-zero syndrome, but also to evaluate the truth-value of the decision. Experimental results show that the proposed approach achieves similar results as one-hot encoding with a softmax function in terms of accuracy, reliability, and out-of-distribution performance. It suggests a good foundation for future applications, mainly classification tasks with a high number of classes.

Download Full-text