Feature Selection Algorithms for Classification and Clustering

2017 ◽

pp. 143-167

Author(s):

Arvind Kumar Tiwari

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Learning Algorithm ◽

High Dimensional ◽

Filter Method ◽

Selection Methods ◽

Wrapper Method ◽

Embedded Method ◽

Selection Algorithms

Feature selection is an important topic in data mining, especially for high dimensional dataset. Feature selection is a process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of learning algorithm. The best subset contains the least number of dimensions that most contribute to accuracy. Feature selection methods can be decomposed into three main classes, one is filter method, another one is wrapper method and third one is embedded method. This chapter presents an empirical comparison of feature selection methods and its algorithm. In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enable to adequately decide which algorithm to use in certain situation. This chapter reviews several fundamental algorithms found in the literature and assess their performance in a controlled scenario.

Download Full-text

Performance evaluation of random forest with feature selection methods in prediction of diabetes

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i1.pp353-359 ◽

2020 ◽

Vol 10 (1) ◽

pp. 353

Author(s):

Raghavendra S ◽

Santosh Kumar J

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Random Forest ◽

Evaluation Method ◽

Learning Algorithm ◽

Selection Methods ◽

Forward Selection ◽

Data Set ◽

Prediction Of Diabetes

<p>Data mining is nothing but the process of viewing data in different angle and compiling it into appropriate information. Recent improvements in the area of data mining and machine learning have empowered the research in biomedical field to improve the condition of general health care. Since the wrong classification may lead to poor prediction, there is a need to perform the better classification which further improves the prediction rate of the medical datasets. When medical data mining is applied on the medical datasets the important and difficult challenges are the classification and prediction. In this proposed work we evaluate the PIMA Indian Diabtes data set of UCI repository using machine learning algorithm like Random Forest along with feature selection methods such as forward selection and backward elimination based on entropy evaluation method using percentage split as test option. The experiment was conducted using R studio platform and we achieved classification accuracy of 84.1%. From results we can say that Random Forest predicts diabetes better than other techniques with less number of attributes so that one can avoid least important test for identifying diabetes.</p>

Download Full-text

A NOVEL FEATURE SELECTION ALGORITHM WITH SUPERVISED MUTUAL INFORMATION FOR CLASSIFICATION

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500279 ◽

2013 ◽

Vol 22 (04) ◽

pp. 1350027

Author(s):

JAGANATHAN PALANICHAMY ◽

KUPPUCHAMY RAMASAMY

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Mutual Information ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Class A ◽

Selection Algorithms ◽

The Relationship ◽

Class Variable

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.

Download Full-text

A Novel Feature Selection Method based on MRMR and Enhanced Flower Pollination Algorithm for High Dimensional Biomedical Data

Current Bioinformatics ◽

10.2174/1574893616666210624130124 ◽

2021 ◽

Vol 16 ◽

Author(s):

Chaokun Yan ◽

Mengyuan Li ◽

Jingjing Ma ◽

Yi Liao ◽

Huimin Luo ◽

...

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

High Dimensional ◽

Flower Pollination Algorithm ◽

Filter Method ◽

Biomedical Data ◽

Local Optima ◽

Flower Pollination ◽

Wrapper Method

Background: The massive amount of biomedical data accumulated in the past decades can be utilized for diagnosing disease. Objective: However, its high dimensionality, small sample sizes, and irrelevant features often have a negative influence on the accuracy and speed of disease prediction. Some existing machine learning models cannot capture the patterns on these datasets accurately without utilizing feature selection. Methods: Filter and wrapper are two prevailing feature selection methods. The filter method is fast but has low prediction accuracy, while the latter can obtain high accuracy but has a formidable computation cost. Given the drawbacks of using filter or wrapper individually, a novel feature selection method, called MRMR-EFPATS, is proposed, which hybridizes filter method minimum redundancy maximum relevance (MRMR) and wrapper method based on an improved flower pollination algorithm (FPA). First, MRMR is employed to rank and screen out some important features quickly. These features are further chosen into population individual of the following wrapper method for faster convergence and less computational time. Then, due to its efficiency and flexibility, FPA is adopted to further discover an optimal feature subset. Result: FPA still has some drawbacks such as slow convergence rate, inadequacy in terms of searching for new solutions, and tends to be trapped in local optima. In our work, an elite strategy is adopted to improve the convergence speed of the FPA. Tabu search and Adaptive Gaussian Mutation are employed to improve the search capability of FPA and escape from local optima. Here, the KNN classifier with the 5-fold-CV is utilized to evaluate the classification accuracy. Conclusion: Extensive experimental results on six public high dimensional biomedical datasets show that the proposed MRMR-EFPATS has achieved superior performance compared with other state-of-the-art methods.

Download Full-text

An Empirical Evaluation of Feature Selection Methods

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch012 ◽

2015 ◽

pp. 233-258 ◽

Cited By ~ 1

Author(s):

Mohsin Iqbal ◽

Saif Ur Rehman ◽

Saira Gillani ◽

Sohail Asghar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Classification Accuracy ◽

Information Gain ◽

Learning Algorithm ◽

Empirical Evaluation ◽

Machine Learning Algorithms ◽

Selection Methods ◽

The One ◽

Processing And Storage

The key objective of the chapter would be to study the classification accuracy, using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of the learning algorithm improved. We test how an integrated feature selection could affect the accuracy of three classifiers by performing feature selection methods. The filter effects show that Information Gain (IG), Gain Ratio (GR) and Relief-f, and wrapper effect show that Bagging and Naive Bayes (NB), enabled the classifiers to give the highest escalation in classification accuracy about the average while reducing the volume of unnecessary attributes. The achieved conclusions can advise the machine learning users, which classifier and feature selection methods to use to optimize the classification accuracy, and this can be important, especially at risk-sensitive applying Machine Learning whereas in the one of the aim to reduce costs of collecting, processing and storage of unnecessary data.

Download Full-text

Improving the Intrusion Detection using Discriminative Machine Learning Approach and Improve the Time Complexity by Data Mining Feature Selection Methods

International Journal of Computer Applications ◽

10.5120/13209-0587 ◽

2013 ◽

Vol 76 (1) ◽

pp. 5-11 ◽

Cited By ~ 14

Author(s):

Karan Bajaj ◽

Amit Arora

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Intrusion Detection ◽

Time Complexity ◽

Learning Approach ◽

Selection Methods ◽

Machine Learning Approach

Download Full-text

PLncWX: A Machine-Learning Algorithm for Plant lncRNA Identification Based on WOA-XGBoost

Journal of Chemistry ◽

10.1155/2021/6256021 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Fei Guo ◽

Zhixiang Yin ◽

Kai Zhou ◽

Jiasi Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Stress Responses ◽

Learning Algorithm ◽

Feature Subset ◽

Selection Methods ◽

Regulate Gene Expression ◽

Model Redundancy ◽

Human And Mouse ◽

Plant Abiotic Stress

Long noncoding RNAs (lncRNAs) are a class of RNAs longer than 200 nt and cannot encode the protein. Studies have shown that lncRNAs can regulate gene expression at the epigenetic, transcriptional, and posttranscriptional levels, which are not only closely related to the occurrence, development, and prevention of human diseases, but also can regulate plant flowering and participate in plant abiotic stress responses such as drought and salt. Therefore, how to accurately and efficiently identify lncRNAs is still an essential job of relevant researches. There have been a large number of identification tools based on machine-learning and deep learning algorithms, mostly using human and mouse gene sequences as training sets, seldom plants, and only using one or one class of feature selection methods after feature extraction. We developed an identification model containing dicot, monocot, algae, moss, and fern. After comparing 20 feature selection methods (seven filter and thirteen wrapper methods) combined with seven classifiers, respectively, considering the correlation between features and model redundancy at the same time, we found that the WOA-XGBoost-based model had better performance with 91.55%, 96.78%, and 91.68% of accuracy, AUC, and F1_score. Meanwhile, the number of elements in the feature subset was reduced to 23, which effectively improved the prediction accuracy and modeling efficiency.

Download Full-text

Combination of Feature Selection and CatBoost for Prediction: The First Application to the Estimation of Aboveground Biomass

Forests ◽

10.3390/f12020216 ◽

2021 ◽

Vol 12 (2) ◽

pp. 216

Author(s):

Mi Luo ◽

Yifu Wang ◽

Yunhong Xie ◽

Lai Zhou ◽

Jingjing Qiao ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forests ◽

Aboveground Biomass ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Machine Learning Algorithm ◽

Selection Methods ◽

Estimation Models

Increasing numbers of explanatory variables tend to result in information redundancy and “dimensional disaster” in the quantitative remote sensing of forest aboveground biomass (AGB). Feature selection of model factors is an effective method for improving the accuracy of AGB estimates. Machine learning algorithms are also widely used in AGB estimation, although little research has addressed the use of the categorical boosting algorithm (CatBoost) for AGB estimation. Both feature selection and regression for AGB estimation models are typically performed with the same machine learning algorithm, but there is no evidence to suggest that this is the best method. Therefore, the present study focuses on evaluating the performance of the CatBoost algorithm for AGB estimation and comparing the performance of different combinations of feature selection methods and machine learning algorithms. AGB estimation models of four forest types were developed based on Landsat OLI data using three feature selection methods (recursive feature elimination (RFE), variable selection using random forests (VSURF), and least absolute shrinkage and selection operator (LASSO)) and three machine learning algorithms (random forest regression (RFR), extreme gradient boosting (XGBoost), and categorical boosting (CatBoost)). Feature selection had a significant influence on AGB estimation. RFE preserved the most informative features for AGB estimation and was superior to VSURF and LASSO. In addition, CatBoost improved the accuracy of the AGB estimation models compared with RFR and XGBoost. AGB estimation models using RFE for feature selection and CatBoost as the regression algorithm achieved the highest accuracy, with root mean square errors (RMSEs) of 26.54 Mg/ha for coniferous forest, 24.67 Mg/ha for broad-leaved forest, 22.62 Mg/ha for mixed forests, and 25.77 Mg/ha for all forests. The combination of RFE and CatBoost had better performance than the VSURF–RFR combination in which random forests were used for both feature selection and regression, indicating that feature selection and regression performed by a single machine learning algorithm may not always ensure optimal AGB estimation. It is promising to extending the application of new machine learning algorithms and feature selection methods to improve the accuracy of AGB estimates.

Download Full-text

Novel Randomized Feature Selection Algorithms

International Journal of Foundations of Computer Science ◽

10.1142/s0129054115500185 ◽

2015 ◽

Vol 26 (03) ◽

pp. 321-341 ◽

Cited By ~ 3

Author(s):

Subrata Saha ◽

Sanguthevar Rajasekaran ◽

Rampi Ramprasad

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Walk ◽

Current Literature ◽

Randomized Algorithms ◽

Learning Algorithm ◽

Vital Role ◽

Model Construction ◽

Simulation Results ◽

Selection Algorithms

Feature selection is the problem of identifying a subset of the most relevant features in the context of model construction. This problem has been well studied and plays a vital role in machine learning. In this paper we present three randomized algorithms for feature selection. They are generic in nature and can be applied for any learning algorithm. Proposed algorithms can be thought of as a random walk in the space of all possible subsets of the features. We demonstrate the generality of our approaches using three different applications. The simulation results show that our feature selection algorithms outperforms some of the best known algorithms existing in the current literature.

Download Full-text

Machine Learning Based Supervised Feature Selection Algorithm for Data Mining

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j9483.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 3396-3401 ◽

Cited By ~ 1

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Learning Algorithm ◽

Modern World ◽

Feature Subset ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Minimum Number ◽

Preprocessing Technique

Data Scientists focus on high dimensional data to predict and reveal some interesting patterns as well as most useful information to the modern world. Feature Selection is a preprocessing technique which improves the accuracy and efficiency of mining algorithms. There exist a numerous feature selection algorithms. Most of the algorithms failed to give better mining results as the scale increases. In this paper, feature selection for supervised algorithms in data mining are considered and given an overview of existing machine learning algorithm for supervised feature selection. This paper introduces an enhanced supervised feature selection algorithm which selects the best feature subset by eliminating irrelevant features using distance correlation and redundant features using symmetric uncertainty. The experimental results show that the proposed algorithm provides better classification accuracy and selects minimum number of features.

Download Full-text