Cooperative co-evolution for feature selection in Big Data with random feature grouping

AbstractA massive amount of data is generated with the evolution of modern technologies. This high-throughput data generation results in Big Data, which consist of many features (attributes). However, irrelevant features may degrade the classification performance of machine learning (ML) algorithms. Feature selection (FS) is a technique used to select a subset of relevant features that represent the dataset. Evolutionary algorithms (EAs) are widely used search strategies in this domain. A variant of EAs, called cooperative co-evolution (CC), which uses a divide-and-conquer approach, is a good choice for optimization problems. The existing solutions have poor performance because of some limitations, such as not considering feature interactions, dealing with only an even number of features, and decomposing the dataset statically. In this paper, a novel random feature grouping (RFG) has been introduced with its three variants to dynamically decompose Big Data datasets and to ensure the probability of grouping interacting features into the same subcomponent. RFG can be used in CC-based FS processes, hence called Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). Experiment analysis was performed using six widely used ML classifiers on seven different datasets from the UCI ML repository and Princeton University Genomics repository with and without FS. The experimental results indicate that in most cases [i.e., with naïve Bayes (NB), support vector machine (SVM), k-Nearest Neighbor (k-NN), J48, and random forest (RF)] the proposed CCFSRFG-1 outperforms an existing solution (a CC-based FS, called CCEAFS) and CCFSRFG-2, and also when using all features in terms of accuracy, sensitivity, and specificity.

Download Full-text

Knowledge management overview of feature selection problem in high-dimensional financial data: cooperative co-evolution and MapReduce perspectives

Problems and Perspectives in Management ◽

10.21511/ppm.17(4).2019.28 ◽

2019 ◽

Vol 17 (4) ◽

pp. 340-359 ◽

Cited By ~ 1

Author(s):

A N M Bazlur Rashid ◽

Tonmoy Choudhury

Keyword(s):

Feature Selection ◽

Big Data ◽

Knowledge Management ◽

Programming Model ◽

Financial Data ◽

Divide And Conquer ◽

High Dimensional ◽

Future Research ◽

Data Generation ◽

Time Data

The term “big data” characterizes the massive amounts of data generation by the advanced technologies in different domains using 4Vs – volume, velocity, variety, and veracity - to indicate the amount of data that can only be processed via computationally intensive analysis, the speed of their creation, the different types of data, and their accuracy. High-dimensional financial data, such as time-series and space-time data, contain a large number of features (variables) while having a small number of samples, which are used to measure various real-time business situations for financial organizations. Such datasets are normally noisy, and complex correlations may exist between their features, and many domains, including financial, lack the al analytic tools to mine the data for knowledge discovery because of the high-dimensionality. Feature selection is an optimization problem to find a minimal subset of relevant features that maximizes the classification accuracy and reduces the computations. Traditional statistical-based feature selection approaches are not adequate to deal with the curse of dimensionality associated with big data. Cooperative co-evolution, a meta-heuristic algorithm and a divide-and-conquer approach, decomposes high-dimensional problems into smaller sub-problems. Further, MapReduce, a programming model, offers a ready-to-use distributed, scalable, and fault-tolerant infrastructure for parallelizing the developed algorithm. This article presents a knowledge management overview of evolutionary feature selection approaches, state-of-the-art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions.

Download Full-text

A Novel Dynamic Hybridization Method for Best Feature Selection

International Journal of Applied Metaheuristic Computing ◽

10.4018/ijamc.2021040106 ◽

2021 ◽

Vol 12 (2) ◽

pp. 85-99

Author(s):

Nassima Dif ◽

Zakaria Elberrichi

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Optimization Problems ◽

Learning Algorithm ◽

Accuracy Score ◽

Hybridization Method ◽

K Nearest Neighbor ◽

Feature Selection Problem ◽

Combinatorial Optimization Problems ◽

The Comparative Study

Hybrid metaheuristics has received a lot of attention lately to solve combinatorial optimization problems. The purpose of hybridization is to create a cooperation between metaheuristics for better solutions. Most proposed works were interested in static hybridization. The objective of this work is to propose a novel dynamic hybridization method (GPBD) that generates the most suitable sequential hybridization between GA, PSO, BAT, and DE metaheuristics, according to each problem. The authors choose to test this approach for solving the best feature selection problem in a wrapper tactic, performed on face image recognition datasets, with the k-nearest neighbor (KNN) learning algorithm. The comparative study of the metaheuristics and their hybridization GPBD shows that the proposed approach achieved the best results. It was definitely competitive with other filter approaches proposed in the literature. It achieved a perfect accuracy score of 100% for Orl10P, Pix10P, and PIE10P datasets.

Download Full-text

Feature Selection and Knapsack Problem Resolution Based on a Discrete Backtracking Optimization Algorithm

International Journal of Applied Evolutionary Computation ◽

10.4018/ijaec.2021040101 ◽

2021 ◽

Vol 12 (2) ◽

pp. 1-15

Author(s):

Khadoudja Ghanem ◽

Abdesslem Layeb

Keyword(s):

Feature Selection ◽

Knapsack Problem ◽

Optimization Algorithm ◽

Optimization Problems ◽

Search Algorithm ◽

Knapsack Problems ◽

Support Vector ◽

Discrete Version ◽

Search Optimization ◽

Discrete Algorithms

Backtracking search optimization algorithm is a recent stochastic-based global search algorithm for solving real-valued numerical optimization problems. In this paper, a binary version of backtracking algorithm is proposed to deal with 0-1 optimization problems such as feature selection and knapsack problems. Feature selection is the process of selecting a subset of relevant features for use in model construction. Irrelevant features can negatively impact model performances. On the other hand, knapsack problem is a well-known optimization problem used to assess discrete algorithms. The objective of this research is to evaluate the discrete version of backtracking algorithm on the two mentioned problems and compare obtained results with other binary optimization algorithms using four usual classifiers: logistic regression, decision tree, random forest, and support vector machine. Empirical study on biological microarray data and experiments on 0-1 knapsack problems show the effectiveness of the binary algorithm and its ability to achieve good quality solutions for both problems.

Download Full-text

A Comparison of the Analysis of Methods for Feature Extraction and Classification by Wavelet Transform in SSVEP BCIs

10.21203/rs.3.rs-82008/v1 ◽

2020 ◽

Author(s):

Hoda Heidari ◽

Zahra Einalou ◽

Mehrdad Dadgostar ◽

Hamidreza Hosseinzadeh

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Wavelet Transform ◽

Decision Tree ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Iir Filters ◽

Wide Range ◽

New Feature

Abstract Most of the studies in the field of Brain-Computer Interface (BCI) based on electroencephalography have a wide range of applications. Extracting Steady State Visual Evoked Potential (SSVEP) is regarded as one of the most useful tools in BCI systems. In this study, different methods such as feature extraction with different spectral methods (Shannon entropy, skewness, kurtosis, mean, variance) (bank of filters, narrow-bank IIR filters, and wavelet transform magnitude), feature selection performed by various methods (decision tree, principle component analysis (PCA), t-test, Wilcoxon, Receiver operating characteristic (ROC)), and classification step applying k nearest neighbor (k-NN), perceptron, support vector machines (SVM), Bayesian, multiple layer perceptron (MLP) were compared from the whole stream of signal processing. Through combining such methods, the effective overview of the study indicated the accuracy of classical methods. In addition, the present study relied on a rather new feature selection described by decision tree and PCA, which is used for the BCI-SSVEP systems. Finally, the obtained accuracies were calculated based on the four recorded frequencies representing four directions including right, left, up, and down.

Download Full-text

Recognition of Common Non-Normal Walking Actions Based on Relief-F Feature Selection and Relief-Bagging-SVM

Sensors ◽

10.3390/s20051447 ◽

2020 ◽

Vol 20 (5) ◽

pp. 1447

Author(s):

Pan Huang ◽

Yanping Li ◽

Xiaoyi Lv ◽

Wen Chen ◽

Shuxian Liu

Keyword(s):

Feature Selection ◽

Action Recognition ◽

Nearest Neighbor ◽

Health Indicators ◽

Support Vector ◽

Normal Walking ◽

K Nearest Neighbor ◽

Recognition Algorithms ◽

Medical Health ◽

Improved Algorithm

Action recognition algorithms are widely used in the fields of medical health and pedestrian dead reckoning (PDR). The classification and recognition of non-normal walking actions and normal walking actions are very important for improving the accuracy of medical health indicators and PDR steps. Existing motion recognition algorithms focus on the recognition of normal walking actions, and the recognition of non-normal walking actions common to daily life is incomplete or inaccurate, resulting in a low overall recognition accuracy. This paper proposes a microelectromechanical system (MEMS) action recognition method based on Relief-F feature selection and relief-bagging-support vector machine (SVM). Feature selection using the Relief-F algorithm reduces the dimensions by 16 and reduces the optimization time by an average of 9.55 s. Experiments show that the improved algorithm for identifying non-normal walking actions has an accuracy of 96.63%; compared with Decision Tree (DT), it increased by 11.63%; compared with k-nearest neighbor (KNN), it increased by 26.62%; and compared with random forest (RF), it increased by 11.63%. The average Area Under Curve (AUC) of the improved algorithm improved by 0.1143 compared to KNN, by 0.0235 compared to DT, and by 0.04 compared to RF.

Download Full-text

TOWARDS AN AUTOMATIC DIAGNOSIS SYSTEM FOR LUMBAR DISC HERNIATION: THE SIGNIFICANCE OF LOCAL SUBSET FEATURE SELECTION

Biomedical Engineering Applications Basis and Communications ◽

10.4015/s1016237218500448 ◽

2018 ◽

Vol 30 (06) ◽

pp. 1850044 ◽

Cited By ~ 1

Author(s):

Elias Ebrahimzadeh ◽

Farahnaz Fayaz ◽

Mehran Nikravan ◽

Fereshteh Ahmadi ◽

Mohammadjavad Rahimi Dolatabad

Keyword(s):

Feature Selection ◽

Lumbar Disc Herniation ◽

Disc Herniation ◽

Nearest Neighbor ◽

Lumbar Disc ◽

Support Vector ◽

K Nearest Neighbor ◽

Daily Lives ◽

Automatic Diagnosis ◽

Cad System

Herniation in the lumbar area is one of the most common diseases which results in lower back pain (LBP) causing discomfort and inconvenience in the patients’ daily lives. A computer aided diagnosis (CAD) system can be of immense benefit as it generates diagnostic results within a short time while increasing precision of diagnosis and eliminating human errors. We have proposed a new method for automatic diagnosis of lumbar disc herniation based on clinical MRI data. We use T2-W sagittal and myelograph images. The presented method has been applied on 30 clinical cases, each containing 7 discs (210 lumbar discs) for the herniation diagnosis. We employ Otsu thresholding method to extract the spinal cord from MR images of lumbar disc. A third order polynomial is then aligned on the extracted spinal cords, and by the end of preprocessing stage, all the T2-W sagittal images will have been prepared for specifying disc boundary and labeling. Having extracted an ROI for each disc, we proceed to use intensity and shape features for classification. The extracted features have been selected by Local Subset Feature Selection. The results demonstrated 91.90%, 92.38% and 95.23% accuracy for artificial neural network, K-nearest neighbor and support vector machine (SVM) classifiers respectively, indicating the superiority of the proposed method to those mentioned in similar studies.

Download Full-text

Modified Firefly Algorithm With Chaos Theory for Feature Selection

International Journal of Swarm Intelligence Research ◽

10.4018/ijsir.2019040101 ◽

2019 ◽

Vol 10 (2) ◽

pp. 1-20 ◽

Cited By ~ 2

Author(s):

Sujata Dash ◽

Ruppa Thulasiram ◽

Parimala Thulasiraman

Keyword(s):

Feature Selection ◽

Firefly Algorithm ◽

Optimization Problems ◽

Search Algorithm ◽

Heuristic Method ◽

High Dimensional ◽

Support Vector ◽

Combinatorial Optimization Problems ◽

Meta Search ◽

Combined Algorithm

Conventional algorithms such as gradient-based optimization methods usually struggle to deal with high-dimensional non-linear problems and often land up with local minima. Recently developed nature-inspired optimization algorithms are the best approaches for finding global solutions for combinatorial optimization problems like microarray datasets. In this article, a novel hybrid swarm intelligence-based meta-search algorithm is proposed by combining a heuristic method called conditional mutual information maximization with chaos-based firefly algorithm. The combined algorithm is computed in an iterative manner to boost the sharing of information between fireflies, enhancing the search efficiency of chaos-based firefly algorithm and reduces the computational complexities of feature selection. The meta-search model is implemented using a well-established classifier, such as support vector machine as the modeler in a wrapper approach. The chaos-based firefly algorithm increases the global search mobility of fireflies. The efficiency of the model is studied over high-dimensional disease datasets and compared with standard firefly algorithm, particle swarm optimization, and genetic algorithm in the same experimental environment to establish its superiority of feature selection over selected counterparts.

Download Full-text

Feature Selection Algorithm for Hyperlipidemia Classification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.110 ◽

2014 ◽

Vol 701-702 ◽

pp. 110-113

Author(s):

Qi Rui Zhang ◽

He Xian Wang ◽

Jiang Wei Qin

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Information Gain ◽

Classification Systems ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Document Frequency ◽

Selection Algorithms ◽

Term Weights

This paper reports a comparative study of feature selection algorithms on a hyperlipimedia data set. Three methods of feature selection were evaluated, including document frequency (DF), information gain (IG) and aχ2 statistic (CHI). The classification systems use a vector to represent a document and use tfidfie (term frequency, inverted document frequency, and inverted entropy) to compute term weights. In order to compare the effectives of feature selection, we used three classification methods: Naïve Bayes (NB), k Nearest Neighbor (kNN) and Support Vector Machines (SVM). The experimental results show that IG and CHI outperform significantly DF, and SVM and NB is more effective than KNN when macro-averagingF1 measure is used. DF is suitable for the task of large text classification.

Download Full-text