scholarly journals A Novel Categorical Data Attribute Split Technique in Decision Tree Learning

2020 ◽  
Vol 9 (1) ◽  
pp. 1607-1612

A new technique is proposed for splitting categorical data during the process of decision tree learning. This technique is based on the class probability representations and manipulations of the class labels corresponding to the distinct values of categorical attributes. For each categorical attribute aggregate similarity in terms of class probabilities is computed and then based on the highest aggregated similarity measure the best attribute is selected and then the data in the current node of the decision tree is divided into the number of sub sets equal to the number of distinct values of the best categorical split attribute. Many experiments are conducted using this proposed method and the results have shown that the proposed technique is better than many other competitive methods in terms of efficiency, ease of use, understanding, and output results and it will be useful in many modern applications.

2018 ◽  
Vol 11 (1) ◽  
pp. 113-121 ◽  
Author(s):  
Nijil Raj N ◽  
T. Mahalekshmi

Multi-label classification methods are important in various fields,such as protein type,protein function, semantic scene classification and music categorization . In multi-label classification, each sample can be associated with a set of class labels. In protein type classification, one of the major types of protein is membrane protein. The Membrane proteins are performing different cellular processes and important functions, which are based on the protein types. Each membrane protein have different rolls at the same time. In this study we proposes membrane protein type classification using Decision Tree (DT) classification algorithm. The DT classifies a membrane protein into six types . An essential set of features are extracted from the membrane protein dataset S1 which are used for the proposed method,and it was revealed an accuracy of 69.81%, whereas existing methods network based and shortest path revealed an accuracy of 66.78%,54.97%.The accuracy got in the existing methods are not for the full set of protein in dataset S1, but it is achieved after removal of few unannotated protein. Both accuracy wise and complexity wise, the proposed method seems to be better than the existing method


Author(s):  
Joshua Zhexue Huang

A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a small set of unique categorical values such as {Female, Male} for the gender attribute. Unlike numeric data, categorical values are discrete and unordered. Therefore, the clustering algorithms for numeric data cannot be used to cluster categorical data that exists in many real world applications. In data mining research, much effort has been put on development of new techniques for clustering categorical data (Huang, 1997b; Huang, 1998; Gibson, Kleinberg, & Raghavan, 1998; Ganti, Gehrke, & Ramakrishnan, 1999; Guha, Rastogi, & Shim, 1999; Chaturvedi, Green, Carroll, & Foods, 2001; Barbara, Li, & Couto, 2002; Andritsos, Tsaparas, Miller, & Sevcik, 2003; Li, Ma, & Ogihara, 2004; Chen, & Liu, 2005; Parmar, Wu, & Blackhurst, 2007). The k-modes clustering algorithm (Huang, 1997b; Huang, 1998) is one of the first algorithms for clustering large categorical data. In the past decade, this algorithm has been well studied and widely used in various applications. It is also adopted in commercial software (e.g., Daylight Chemical Information Systems, Inc, http://www. daylight.com/).


Author(s):  
ADNAN AMIN

This paper presents a new technique for the recognition of hand-printed Latin characters using machine learning. Conventional methods have relied on manually constructed dictionaries which are not only tedious to construct but also difficult to make tolerant to variation in writing styles. The advantages of machine learning are that it can generalize over a large degree of variation between writing styles, and recognition rules can be constructed by example. Characters are scanned into the computer and preprocessing techniques transform the bit-map representation of the characters into a set of primitives which can be represented in an attribute base form. A set of such representations for each character is then input to C4.5 which produces a decision tree for classifying each character.


Author(s):  
Elena Battaglia ◽  
Simone Celano ◽  
Ruggero G. Pensa

AbstractMost privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.


2019 ◽  
Vol 7 (2) ◽  
pp. 54-59
Author(s):  
R. Raja Aswathi ◽  
◽  
K. Pazhani Kumar ◽  
B. Ramakrishnan

The algorithm C4.5 is an efficient decision tree based classification, which is derived from the ID3 approach. C4.5 is also a rule based classification algorithm. The main importance of the C4.5 algorithm is that it can deal with categorical data, over fitting of data and handling of missing values. The performance of C4.5 is superior to ID3 even with equal number of attributes. The EC4.5 (Exponential C4.5) is an extension of C4.5 algorithm which uses exponential of split value to predict the gain of attributes and handled the set back reported in C4.5. However the EC4.5 has some misclassification of data and to avoid this problem a new technique is introduced. This paper proposes a proficient technique TMC4.5 (Taylor-Madhava C4.5) to reduce the uncertainty in classification of data by integrating an exponential split value in EC4.5 and sin splitting value derived from the Madhava series. By using this technique an optimized gain value is obtained that reduces uncertainty. From the obtained result the TMC4.5 has far better results than the C4.5 and EC4.5 algorithms.


2018 ◽  
Vol 18 (1) ◽  
pp. 11-29 ◽  
Author(s):  
Dharmaraj R. Patil ◽  
J. B. Patil

Abstract Researchers all over the world have provided significant and effective solutions to detect malicious URLs. Still due to the ever changing nature of cyberattacks, there are many open issues. In this paper, we have provided an effective hybrid methodology with new features to deal with this problem. To evaluate our approach, we have used state-of-the-arts supervised decision tree learning classifications models. We have performed our experiments on the balanced dataset. The experimental results show that, by inclusion of new features all the decision tree learning classifiers work well on our labeled dataset, achieving 98-99% detection accuracy with very low False Positive Rate (FPR) and False Negative Rate (FNR). Also we have achieved 99.29% detection accuracy with very low FPR and FNR using majority voting technique, which is better than the wellknown anti-virus and anti-malware solutions.


1994 ◽  
Vol 1 ◽  
pp. 209-229 ◽  
Author(s):  
C. X. Ling

Learning the past tense of English verbs - a seemingly minor aspect of language acquisition - has generated heated debates since 1986, and has become a landmark task for testing the adequacy of cognitive modeling. Several artificial neural networks (ANNs) have been implemented, and a challenge for better symbolic models has been posed. In this paper, we present a general-purpose Symbolic Pattern Associator (SPA) based upon the decision-tree learning algorithm ID3. We conduct extensive head-to-head comparisons on the generalization ability between ANN models and the SPA under different representations. We conclude that the SPA generalizes the past tense of unseen verbs better than ANN models by a wide margin, and we offer insights as to why this should be the case. We also discuss a new default strategy for decision-tree learning algorithms.


1998 ◽  
Vol 23 (6) ◽  
pp. 111-120 ◽  
Author(s):  
Gou Masuda ◽  
Norihiro Sakamoto ◽  
Kazuo Ushijima

2021 ◽  
Vol 11 (8) ◽  
pp. 3563
Author(s):  
Martin Klimo ◽  
Peter Lukáč ◽  
Peter Tarábek

One-hot encoding is the prevalent method used in neural networks to represent multi-class categorical data. Its success stems from its ease of use and interpretability as a probability distribution when accompanied by a softmax activation function. However, one-hot encoding leads to very high dimensional vector representations when the categorical data’s cardinality is high. The Hamming distance in one-hot encoding is equal to two from the coding theory perspective, which does not allow detection or error-correcting capabilities. Binary coding provides more possibilities for encoding categorical data into the output codes, which mitigates the limitations of the one-hot encoding mentioned above. We propose a novel method based on Zadeh fuzzy logic to train binary output codes holistically. We study linear block codes for their possibility of separating class information from the checksum part of the codeword, showing their ability not only to detect recognition errors by calculating non-zero syndrome, but also to evaluate the truth-value of the decision. Experimental results show that the proposed approach achieves similar results as one-hot encoding with a softmax function in terms of accuracy, reliability, and out-of-distribution performance. It suggests a good foundation for future applications, mainly classification tasks with a high number of classes.


Sign in / Sign up

Export Citation Format

Share Document