Development of the Edible and Poisonous Mushrooms Classification Model by using the Feature Selection and the Decision Tree Techniques

This research aims to develop a classification model for edible and poisonous mushrooms by applying the feature selection approach together with the decision tree technique. Two feature selection methods were applied, including 1) Chi-square and 2) Information Gain, while the effectiveness of the model was compared by three decision tree methods such as Iterative Dichotomiser3, C4.5 and Random Forest. The data used for classifying the edible and poisonous mushrooms derived from the Encyclopedia of Thai mushrooms and the book entitled “Diversity of Mushrooms and Macrofungi in Thailand”. The results of the model’s effectiveness evaluation revealed that the model using the Information Gain technique alongside with the Random Forest technique provided the most accurate classification outcomes at 94.19%; therefore, this model could be further applied in the future studies.

2020 ◽  
Vol 37 (4) ◽  
pp. 563-569
Dželila Mehanović ◽  
Jasmin Kevrić

Security is one of the most actual topics in the online world. Lists of security threats are constantly updated. One of those threats are phishing websites. In this work, we address the problem of phishing websites classification. Three classifiers were used: K-Nearest Neighbor, Decision Tree and Random Forest with the feature selection methods from Weka. Achieved accuracy was 100% and number of features was decreased to seven. Moreover, when we decreased the number of features, we decreased time to build models too. Time for Random Forest was decreased from the initial 2.88s and 3.05s for percentage split and 10-fold cross validation to 0.02s and 0.16s respectively.

2019 ◽  
pp. 016555151986159 ◽  
Ala’ M Al-Zoubi ◽  
Ja’far Alqatawna ◽  
Hossam Faris ◽  
Mohammad A Hassonah

In online social networks, spam profiles represent one of the most serious security threats over the Internet; if they do not stop producing bad advertisements, they can be exploited by criminals for various purposes. This article addresses the nature and the characteristics of spam profiles in a social network like Twitter to improve spam detection, based on a number of publicly available language-independent features. In order to investigate the effectiveness of these features in spam detection, four datasets are extracted for four different language contexts (i.e. Arabic, English, Korean and Spanish), and a fifth is formed by combining them all. We conduct our experiments using a set of five well-known classification algorithms in spam detection field, k-Nearest Neighbours ( k-NN), Random Forest (RF), Naive Bayes (NB), Decision Tree (DT) (J48) and Multilayer Perceptron (MLP) classifiers, along with five filter-based feature selection methods, namely, Information Gain, Chi-square, ReliefF, Correlation and Significance. The results show oscillating performance of each classifier across all datasets, but improved classification results with feature selection. In addition, detailed analysis and comparisons are carried out on two different levels: in the first level, we compare the selected features’ importance among the feature selection methods, whereas in the second level, we observe the relations and the importance of the selected features across all datasets. The findings of this article lead to a better understanding of social spam and improving detection methods by considering the various important features resulting from the different lingual contexts.

2020 ◽  
Vol 10 (1) ◽  
pp. 74-80
Hivi I. Dino ◽  
Maiwan B. Abdulrazzaq

Facial expression recognition (FER) has achieved an extreme role in research area since the 1990s. This paper provides a comparison approach for FER based on three feature selection methods which are correlation, gain ration, and information gain for determining the most distinguished features of face images using multi-classification algorithms which are multilayer perceptron, Naïve Bayes, decision tree, and K-nearest neighbor (KNN). These classifiers are used for the mission of expression recognition and for comparing their proportional performance. The main aim of the provided approach is to determine the most effective classifier based on minimum acceptable number of features by analyzing and comparing their performance. The provided approach has been applied on CK+ dataset. The experimental results show that KNN is proven to be better classifier with 91% accuracy using only 30 features.

2020 ◽  
Vol 184 ◽  
pp. 01011
Sreethi Musunuru ◽  
Mahaalakshmi Mukkamala ◽  
Latha Kunaparaju ◽  
N V Ganapathi Raju

Though banks hold an abundance of data on their customers in general, it is not unusual for them to track the actions of the creditors regularly to improve the services they offer to them and understand why a lot of them choose to exit and shift to other banks. Analyzing customer behavior can be highly beneficial to the banks as they can reach out to their customers on a personal level and develop a business model that will improve the pricing structure, communication, advertising, and benefits for their customers and themselves. Features like the amount a customer credits every month, his salary per annum, the gender of the customer, etc. are used to classify them using machine learning algorithms like K Neighbors Classifier and Random Forest Classifier. On classifying the customers, banks can get an idea of who will be continuing with them and who will be leaving them in the near future. Our study determines to remove the features that are independent but are not influential to determine the status of the customers in the future without the loss of accuracy and to improve the model to see if this will also increase the accuracy of the results.

2020 ◽  
Vol 43 (1) ◽  
pp. 103-125
Yi Zhong ◽  
Jianghua He ◽  
Prabhakar Chalise

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.

2018 ◽  
Vol 7 (3.12) ◽  
pp. 344
Jayesh Deep Dubey ◽  
Deepak Arora ◽  
Pooja Khanna

Analysis of EEG data is one of the most important parts of Brain Computer Interface systems because EEG data consists of a substantial amount of crucial information that can be used for better study and improvements in BCI system. One of the problems with the analysis of EEG is the large amount of data that is produced, some of which might not be useful for the analysis. Therefore identifying the relevant data from the large amount of EEG data is important for better analysis. The objective of this study is to find out the performance of Random Forest classifier on the motor movement EEG data and reducing the number of electrodes that are considered in the EEG recording and analysis so that the amount of data that is produced through EEG recording is reduced and only relevant electrodes are considered in the analysis. The dataset used in the study is Physionet motor movement/imagery data which consists of EEG recordings obtained using 64 electrodes. These 64 electrodes were ranked based on their information gain with respect to the class using Info Gain attribute selection algorithm. The electrodes were then divided into 4 lists. List 1 consists of top 18 ranked electrodes and number of electrodes was increased by 15 [in ranked order] in each subsequent list. List 2, 3 and 4 consists of top 33, 48 and 64 electrodes respectively. The accuracy of random forest classifier for each of the list was compared with the accuracy of the classifier for the List 4 which consists of all the 64 electrodes. The additional electrodes in the List 4 were rejected because the accuracy of the classifier was almost same for List 4 and List3. Through this method we were able to reduce the electrodes from 64 to 48 with an average decrease of only 0.9% in the accuracy of the classifier. This reduction in the electrode can substantially reduce the time and effort required for analysis of EEG data.      

2014 ◽  
Vol 988 ◽  
pp. 511-516 ◽  
Jin Tao Shi ◽  
Hui Liang Liu ◽  
Yuan Xu ◽  
Jun Feng Yan ◽  
Jian Feng Xu

Machine learning is important solution in the research of Chinese text sentiment categorization , the text feature selection is critical to the classification performance. However, the classical feature selection methods have better effect on the global categories, but it misses many representative feature words of each category. This paper presents an improved information gain method that integrates word frequency and degree of feature word sentiment into traditional information gain methods. Experiments show that classifier improved by this method has better classification .

Nida Tariq ◽  
Iqra Ijaz ◽  
Muhammad Kamran Malik ◽  
Zubair Malik ◽  
Faisal Bukhari

Urdu literature has a rich tradition of poetry, with many forms, one of which is Ghazal. Urdu poetry structures are mainly of Arabic origin. It has complex and different sentence structure compared to our daily language which makes it hard to classify. Our research is focused on the identification of poets if given with ghazals as input. Previously, no one has done this type of work. Two main factors which help categorize and classify a given text are the contents and writing style. Urdu poets like Mirza Ghalib, Mir Taqi Mir, Iqbal and many others have a different writing style and the topic of interest. Our model caters these two factors, classify ghazals using different classification models such as SVM (Support Vector Machines), Decision Tree, Random forest, Naïve Bayes and KNN (K-Nearest Neighbors). Furthermore, we have also applied feature selection techniques like chi square model and L1 based feature selection. For experimentation, we have prepared a dataset of about 4000 Ghazals. We have also compared the accuracy of different classifiers and concluded the best results for the collected dataset of Ghazals.

Sensors ◽  
2020 ◽  
Vol 20 (5) ◽  
pp. 1447
Pan Huang ◽  
Yanping Li ◽  
Xiaoyi Lv ◽  
Wen Chen ◽  
Shuxian Liu

Action recognition algorithms are widely used in the fields of medical health and pedestrian dead reckoning (PDR). The classification and recognition of non-normal walking actions and normal walking actions are very important for improving the accuracy of medical health indicators and PDR steps. Existing motion recognition algorithms focus on the recognition of normal walking actions, and the recognition of non-normal walking actions common to daily life is incomplete or inaccurate, resulting in a low overall recognition accuracy. This paper proposes a microelectromechanical system (MEMS) action recognition method based on Relief-F feature selection and relief-bagging-support vector machine (SVM). Feature selection using the Relief-F algorithm reduces the dimensions by 16 and reduces the optimization time by an average of 9.55 s. Experiments show that the improved algorithm for identifying non-normal walking actions has an accuracy of 96.63%; compared with Decision Tree (DT), it increased by 11.63%; compared with k-nearest neighbor (KNN), it increased by 26.62%; and compared with random forest (RF), it increased by 11.63%. The average Area Under Curve (AUC) of the improved algorithm improved by 0.1143 compared to KNN, by 0.0235 compared to DT, and by 0.04 compared to RF.

Lubricant condition monitoring (LCM), part of condition monitoring techniques under Condition Based Maintenance, monitors the condition and state of the lubricant which reveal the condition and state of the equipment. LCM has proved and evidenced to represent a key concept driving maintenance decision making involving sizeable number of parameter (variables) tests requiring classification and interpretation based on the lubricant’s condition. Reduction of the variables to a manageable and admissible level and utilization for prediction is key to ensuring optimization of equipment performance and lubricant condition. This study advances a methodology on feature selection and predictive modelling of in-service oil analysis data to assist in maintenance decision making of critical equipment. Proposed methodology includes data pre-processing involving cleaning, expert assessment and standardization due to the different measurement scales. Limits provided by the Original Equipment Manufacturers (OEM) are used by the analysts to manually classify and indicate samples with significant lubricant deterioration. In the last part of the methodology, Random Forest (RF) is used as a feature selection tool and a Decision Tree-based (DT) classification of the in-service oil samples. A case study of a thermal power plant is advanced, to which the framework is applied. The selection of admissible variables using Random Forest exposes critical used oil analysis (UOA) variables indicative of lubricant/machine degradation, while DT model, besides predicting the classification of samples, offers visual interpretability of parametric impact to the classification outcome. The model evaluation returned acceptable predictive, while the framework renders speedy classification with insights for maintenance decision making, thus ensuring timely interventions. Moreover, the framework highlights critical and relevant oil analysis parameters that are indicative of lubricant degradation; hence, by addressing such critical parameters, organizations can better enhance the reliability of their critical operable equipment.

Sign in / Sign up

Export Citation Format

Share Document