Some Thoughts at the Interface of Ensemble Methods and Feature Selection

Author(s):  
Gavin Brown
2013 ◽  
Vol 22 (03) ◽  
pp. 1350010 ◽  
Author(s):  
SABEREH SADEGHI ◽  
HAMID BEIGY

Dimensionality reduction is a necessary task in data mining when working with high dimensional data. A type of dimensionality reduction is feature selection. Feature selection based on feature ranking has received much attention by researchers. The major reasons are its scalability, ease of use, and fast computation. Feature ranking methods can be divided into different categories and may use different measures for ranking features. Recently, ensemble methods have entered in the field of ranking and achieved more accuracy among others. Accordingly, in this paper a Heterogeneous ensemble based algorithm for feature ranking is proposed. The base ranking methods in this ensemble structure are chosen from different categories like information theoretic, distance based, and statistical methods. The results of the base ranking methods are then fused into a final feature subset by means of genetic algorithm. The diversity of the base methods improves the quality of initial population of the genetic algorithm and thus reducing the convergence time of the genetic algorithm. In most of ranking methods, it's the user's task to determine the threshold for choosing the appropriate subset of features. It is a problem, which may cause the user to try many different values to select a good one. In the proposed algorithm, the difficulty of determining a proper threshold by the user is decreased. The performance of the algorithm is evaluated on four different text datasets and the experimental results show that the proposed method outperforms all other five feature ranking methods used for comparison. One advantage of the proposed method is that it is independent to the classification method used for classification.


2020 ◽  
Vol 5 (1) ◽  
pp. 28-48
Author(s):  
Rebwar M. Nabi ◽  
Soran Ab. M. Saeed ◽  
Habibollah Harron

The prediction of stock prices has become an exciting area for researchers as well as academicians due to its economic impact and potential business profits. This study proposes a novel multiclass classification ensemble learning approach for predicting stock prices based on historical data using feature engineering. The proposed approach comprises four main steps, which are pre-processing, feature selection, feature engineering, and ensemble methods. We use 11 datasets from Nasdaq and S&P 500 to ensure the accuracy of the proposed approach. Furthermore, eight feature selection algorithms are studied and implemented. More importantly, a feature engineering concept is applied to construct two new features, which are appears to be very auspicious in terms of improving classification accuracy, and this is considered the first study to use feature engineering for multiclass classification using ensemble methods. Finally, seven ensemble machine learning (ML) algorithms are used and compared to discover the ultimate collaboration prediction model. Besides, the best feature selection algorithm is proposed. This study proposes a novel multiclass classification approach called Gradient Boosting Machine with Feature Engineering (GBM-wFE) and Principal Component Analysis (PCA) as the feature selection. We find that GBM-wFE outperforms the previous studies and the overall prediction results are auspicious, as MAPE of 0.0406% is achieved, which is considered the best result compared to the available studies in the literature.


Author(s):  
Ahmad A Saifan ◽  
Lina Abu-wardih

Two primary issues have emerged in the machine learning and data mining community: how to deal with imbalanced data and how to choose appropriate features. These are of particular concern in the software engineering domain, and more specifically the field of software defect prediction. This research highlights a procedure which includes a feature selection technique to single out relevant attributes, and an ensemble technique to handle the class-imbalance issue. In order to determine the advantages of feature selection and ensemble methods we look at two potential scenarios: (1) Ensemble models constructed from the original datasets, without feature selection; (2) Ensemble models constructed from the reduced datasets after feature selection has been applied. Four feature selection techniques are employed: Principal Component Analysis (PCA), Pearson’s correlation, Greedy Stepwise Forward selection, and Information Gain (IG). The aim of this research is to assess the effectiveness of feature selection techniques using ensemble techniques. Five datasets, obtained from the PROMISE software depository, are analyzed; tentative results indicate that ensemble methods can improve the model's performance without the use of feature selection techniques. PCA feature selection and bagging based on K-NN perform better than both bagging based on SVM and boosting based on K-NN and SVM, and feature selection techniques including Pearson’s correlation, Greedy stepwise, and IG weaken the ensemble models’ performance.


Author(s):  
*Fadare Oluwaseun Gbenga ◽  
Adetunmbi Adebayo Olusola ◽  
(Mrs) Oyinloye Oghenerukevwe Eloho ◽  
Mogaji Stephen Alaba

The multiplication of malware variations is probably the greatest problem in PC security and the protection of information in form of source code against unauthorized access is a central issue in computer security. In recent times, machine learning has been extensively researched for malware detection and ensemble technique has been established to be highly effective in terms of detection accuracy. This paper proposes a framework that combines combining the exploit of both Chi-square as the feature selection method and eight ensemble learning classifiers on five base learners- K-Nearest Neighbors, Naïve Bayes, Support Vector Machine, Decision Trees, and Logistic Regression. K-Nearest Neighbors returns the highest accuracy of 95.37%, 87.89% on chi-square, and without feature selection respectively. Extreme Gradient Boosting Classifier ensemble accuracy is the highest with 97.407%, 91.72% with Chi-square as feature selection, and ensemble methods without feature selection respectively. Extreme Gradient Boosting Classifier and Random Forest are leading in the seven evaluative measures of chi-square as a feature selection method and ensemble methods without feature selection respectively. The study results show that the tree-based ensemble model is compelling for malware classification.


Author(s):  
CHANDRALEKHA MOHAN ◽  
SHENBAGAVADIVU NAGARAJAN

Researchers train and build specific models to classify the presence and absence of a disease and the accuracy of such classification models is continuously improved. The process of building a model and training depends on the medical data utilized. Various machine learning techniques and tools are used to handle different data with respect to disease types and their clinical conditions. Classification is the most widely used technique to classify disease and the accuracy of the classifier largely depends on the attributes. The choice of the attribute largely affects the diagnosis and performance of the classifier. Due to growing large volumes of medical data across different clinical conditions, the need for choosing relevant attributes and features still lacks method to handle datasets that target specific diseases. This study uses an ensemble-based feature selection using random trees and wrapper method to improve the classification. The proposed ensemble learning classification method derives a subset using the wrapper method, bagging, and random trees. The proposed method removes the irrelevant features and selects the optimal features for classification through probability weighting criteria. The improved algorithm has the ability to distinguish the relevant features from irrelevant features and improve the classification performance. The proposed feature selection method is evaluated using SVM, RF, and NB evaluators and the performances are compared against the FSNBb, FSSVMb, GASVMb, GANBb, and GARFb methods. The proposed method achieves mean classification accuracy of 92% and outperforms the other ensemble methods.


2019 ◽  
Vol 11 (11-SPECIAL ISSUE) ◽  
pp. 400-411
Author(s):  
Pulugu Dileep ◽  
Kunjam Nageswara Rao ◽  
Prajna Bodapati

Sign in / Sign up

Export Citation Format

Share Document