Attribute Selection Based on Information Gain for Automatic Grouping Student System

Author(s):  
Oktariani Nurul Pratiwi ◽  
Budi Rahardjo ◽  
Suhono Harso Supangkat
2018 ◽  
Vol 7 (3.12) ◽  
pp. 344
Author(s):  
Jayesh Deep Dubey ◽  
Deepak Arora ◽  
Pooja Khanna

Analysis of EEG data is one of the most important parts of Brain Computer Interface systems because EEG data consists of a substantial amount of crucial information that can be used for better study and improvements in BCI system. One of the problems with the analysis of EEG is the large amount of data that is produced, some of which might not be useful for the analysis. Therefore identifying the relevant data from the large amount of EEG data is important for better analysis. The objective of this study is to find out the performance of Random Forest classifier on the motor movement EEG data and reducing the number of electrodes that are considered in the EEG recording and analysis so that the amount of data that is produced through EEG recording is reduced and only relevant electrodes are considered in the analysis. The dataset used in the study is Physionet motor movement/imagery data which consists of EEG recordings obtained using 64 electrodes. These 64 electrodes were ranked based on their information gain with respect to the class using Info Gain attribute selection algorithm. The electrodes were then divided into 4 lists. List 1 consists of top 18 ranked electrodes and number of electrodes was increased by 15 [in ranked order] in each subsequent list. List 2, 3 and 4 consists of top 33, 48 and 64 electrodes respectively. The accuracy of random forest classifier for each of the list was compared with the accuracy of the classifier for the List 4 which consists of all the 64 electrodes. The additional electrodes in the List 4 were rejected because the accuracy of the classifier was almost same for List 4 and List3. Through this method we were able to reduce the electrodes from 64 to 48 with an average decrease of only 0.9% in the accuracy of the classifier. This reduction in the electrode can substantially reduce the time and effort required for analysis of EEG data.      


2019 ◽  
Vol 46 (3) ◽  
pp. 325-339
Author(s):  
Muhammad Shaheen ◽  
Tanveer Zafar ◽  
Sajid Ali Khan

Selection of an attribute for placement of the decision tree at an appropriate position (e.g. root of the tree) is an important decision. Many attribute selection measures such as Information Gain, Gini Index and Entropy have been developed for this purpose. The suitability of an attribute generally depends on the diversity of its values, relevance and dependency. Different attribute selection measures have different criteria for measuring the suitability of an attribute. Diversity Index is a classical statistical measure for determining the diversity of values, and according to our knowledge, it has never been used as an attribute selection method. In this article, we propose a novel attribute selection method for decision tree classification. In the proposed scheme, the average of Information Gain, Gini Index and Diversity Index are taken into account for assigning a weight to the attributes. The attribute with the highest average value is selected for the classification. We have empirically tested our proposed algorithm for classification of different data sets of scientific journals and conferences. We have developed a web-based application named JC-Rank that makes use of our proposed algorithm. We have also compared the results of our proposed technique with some existing decision tree classification algorithms.


2019 ◽  
Vol 1196 ◽  
pp. 012021
Author(s):  
Ahmad Fali Oklilas ◽  
Tasmi ◽  
Sri Desy Siswanti ◽  
Mira Afrina ◽  
Herri Setiawan

In the credit card industry, fraud is one of the major issues to handle as sometimes the genuine credit card customers may get misclassified as fraudulent and vice-versa. Several detection systems have been developed but the complexity of these systems along with accuracy and precision limits its usefulness in fraud detection applications. In this paper, a new methodology Support Vector Machine with Information Gain (SVMIG) to improve the accuracy of identifying the fraudulent transactions with high true positive rate for the detection of frauds in credit card is proposed. In SVMIG, the min-max normalization is used to normalize the attributes and the feature set of the attributes are reduced by using information gain based attribute selection. Further, the Apriori algorithm is used to select the frequent attribute set and to reduce the candidate’s itemset size while detecting fraud. The experimental results suggest that the proposed algorithm achieves 94.102% higher accuracy on the standard dataset compared to the existing Bayesian and random forest based approaches for a large sample size in dealing with legal and fraudulent transactions


2011 ◽  
Vol 52-54 ◽  
pp. 168-173
Author(s):  
Mao Ling Pen ◽  
Ai Ming Huang

Many network application technology need the algorithm for multi-dimensional packet classification, for example ,network security ,load balancing ,router policy, QoS etc. Considering the levels of multiattribute packet classified are excessive and traverse rule table times without number for matching classification rule, so efficiency is lower. A packet classification algorithm based on decision tree is put forward in the paper. As compared with some traditional packet classification matching algorithms, because three data are adopted including information gain, information gain ratio and Gini to solve attribute selection measurement, accuracy and matching efficiency are both advanced obviously.


Customer Relationship Ma agement tends to analyze datasets to find insights about data which in turn helps to frame the business strategy for improvement of enterprises. Analyzing data in CRM requires high intensive models. Machine Learning (ML) algorithms help in analyzing such large dimensional datasets. In most real time datasets, the strong independence assumption of Naive Bayes (NB) between the attributes are violated and due to other various drawbacks in datasets like irrelevant data, partially irrelevant data and redundant data, it leads to poor performance of prediction. Feature selection is a preprocessing method applied, to enhance the predication of the NB model. Further, empirical experiments are conducted based on NB with Feature selection and NB without feature selection. In this paper, a empirical study of attribute selection is experimented for five dissimilar filter based feature selection such as Relief-F, Pearson correlation (PCC), Symmetrical Uncertainty (SU), Gain Ratio (GR) and Information Gain (IG).


2019 ◽  
Vol 8 (3) ◽  
pp. 5659-5663

The film business is a billion-dollar business, and extensive measure of data identified with motion pictures is accessible over the web. In this system we are analyzing the dataset for predicting the success of the movies. For doing this the analysis of the dataset is done in which the chronicled information of every segment, for example, actor, actress, director, music that impacts the achievement or disappointment of a motion picture is given weight age and after that dependent on different parameters we are predicting whether the movie will be a flop, average or superhit. Certain algorithms are used that can help to predict whether the movies will be a flop, average, or superhit. In this model we focus on the attribute selection for predicting success of the movies. A comparative analysis is to be performed so as to find the accurate results among the algorithms used. Few parameters that are important for predicting success of a movie are gross, genres, release date, star powers of actors, actress, directors, and budget etc. In the dataset there are 28 parameters. The task is to find out most relevant parameters. This will be achieved by Feature selection method as shown in figure 1. Feature selection method is present in “sklearn” library of python. Feature selection method includes Decision trees, information gain, gain ratio. Generating heatmap to visualize success of movie in different regions. Various graphs are generated between time vs algorithms and accuracy vs algorithms for analysis.


Sign in / Sign up

Export Citation Format

Share Document