scholarly journals Enhancing Big Data Feature Selection Using a Hybrid Correlation-Based Feature Selection

Electronics ◽  
2021 ◽  
Vol 10 (23) ◽  
pp. 2984
Author(s):  
Masurah Mohamad ◽  
Ali Selamat ◽  
Ondrej Krejcar ◽  
Ruben Gonzalez Crespo ◽  
Enrique Herrera-Viedma ◽  
...  

This study proposes an alternate data extraction method that combines three well-known feature selection methods for handling large and problematic datasets: the correlation-based feature selection (CFS), best first search (BFS), and dominance-based rough set approach (DRSA) methods. This study aims to enhance the classifier’s performance in decision analysis by eliminating uncorrelated and inconsistent data values. The proposed method, named CFS-DRSA, comprises several phases executed in sequence, with the main phases incorporating two crucial feature extraction tasks. Data reduction is first, which implements a CFS method with a BFS algorithm. Secondly, a data selection process applies a DRSA to generate the optimized dataset. Therefore, this study aims to solve the computational time complexity and increase the classification accuracy. Several datasets with various characteristics and volumes were used in the experimental process to evaluate the proposed method’s credibility. The method’s performance was validated using standard evaluation measures and benchmarked with other established methods such as deep learning (DL). Overall, the proposed work proved that it could assist the classifier in returning a significant result, with an accuracy rate of 82.1% for the neural network (NN) classifier, compared to the support vector machine (SVM), which returned 66.5% and 49.96% for DL. The one-way analysis of variance (ANOVA) statistical result indicates that the proposed method is an alternative extraction tool for those with difficulties acquiring expensive big data analysis tools and those who are new to the data analysis field.

2018 ◽  
Vol 1 (2) ◽  
pp. 109-117
Author(s):  
Muhammad Imron Rosadi ◽  
Cahya Bagus Sanjaya ◽  
Lukman Hakim

Diabetic Retinopathy is a disease common complications of diabetes mellitus. The complications in the form of damages on the part of the retina of the eye.  The high levels of glucose in the blood are the cause of small capillaries become broke and can lead to blindness. The symptoms shown by the sufferers of Diabetic Retinopaythy (DR), among others, microaneurysms, hemorrhages, exudates, soft hard exudate and neovascularization. These symptoms are at a certain intensity can be an indicator of the phase (the level of severity) DR sufferers. There are four stages of the process of pattern recognition, namely preprocessing,feature ekstraction, feature selection and classification. On preprocessing the image do Change the RGB image into Green channel, image Adaptive Histogram Equalization, removal of blood vessels, removal of optic disks, detection of exudate. A collection from the results of preprocessing placed in the vector of characteristics by using the feature extraction of GLCM consisting of order 1 and 2, to order then conducted as input Support Vector Machine (SVM). While in SVM there are three issues that emerged, namely; How to select a kernel function, what is the optimal number of input features, and how to determine the best kernel parameters. These issues are important, because the number of features affect the required kernel parameters values and vice versa, so that the selection of the features required in building the classification system. On the research of feature extraction methods was presented GLCM, features selection, and SVM for detecting diabetic retinopathy. feature selection process using the F-Score feature to select the results of features extraction. From the results of the selection of these features is used to input the classification. The dataset used amounted to 50 data, which is divided into 2 classes, where 25 sets taken from normal retinal scans and 25 sets of the rest of the scan of the retina with diabetic retinopathy. SVM classification with feature selection to increase accuracy and computational time than lose without a selection of features with a value of 90% accuracy and computational time 0.010 seconds.


2014 ◽  
Vol 1 (2) ◽  
pp. 293-314 ◽  
Author(s):  
Jianqing Fan ◽  
Fang Han ◽  
Han Liu

Abstract Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.


Entropy ◽  
2020 ◽  
Vol 22 (9) ◽  
pp. 989
Author(s):  
Rui Ying Goh ◽  
Lai Soon Lee ◽  
Hsin-Vonn Seow ◽  
Kathiresan Gopal

Credit scoring is an important tool used by financial institutions to correctly identify defaulters and non-defaulters. Support Vector Machines (SVM) and Random Forest (RF) are the Artificial Intelligence techniques that have been attracting interest due to their flexibility to account for various data patterns. Both are black-box models which are sensitive to hyperparameter settings. Feature selection can be performed on SVM to enable explanation with the reduced features, whereas feature importance computed by RF can be used for model explanation. The benefits of accuracy and interpretation allow for significant improvement in the area of credit risk and credit scoring. This paper proposes the use of Harmony Search (HS), to form a hybrid HS-SVM to perform feature selection and hyperparameter tuning simultaneously, and a hybrid HS-RF to tune the hyperparameters. A Modified HS (MHS) is also proposed with the main objective to achieve comparable results as the standard HS with a shorter computational time. MHS consists of four main modifications in the standard HS: (i) Elitism selection during memory consideration instead of random selection, (ii) dynamic exploration and exploitation operators in place of the original static operators, (iii) a self-adjusted bandwidth operator, and (iv) inclusion of additional termination criteria to reach faster convergence. Along with parallel computing, MHS effectively reduces the computational time of the proposed hybrid models. The proposed hybrid models are compared with standard statistical models across three different datasets commonly used in credit scoring studies. The computational results show that MHS-RF is most robust in terms of model performance, model explainability and computational time.


2020 ◽  
Vol 14 (3) ◽  
pp. 269-279
Author(s):  
Hayet Djellali ◽  
Nacira Ghoualmi-Zine ◽  
Souad Guessoum

This paper investigates feature selection methods based on hybrid architecture using feature selection algorithm called Adapted Fast Correlation Based Feature selection and Support Vector Machine Recursive Feature Elimination (AFCBF-SVMRFE). The AFCBF-SVMRFE has three stages and composed of SVMRFE embedded method with Correlation based Features Selection. The first stage is the relevance analysis, the second one is a redundancy analysis, and the third stage is a performance evaluation and features restoration stage. Experiments show that the proposed method tested on different classifiers: Support Vector Machine SVM and K nearest neighbors KNN provide a best accuracy on various dataset. The SVM classifier outperforms KNN classifier on these data. The AFCBF-SVMRFE outperforms FCBF multivariate filter, SVMRFE, Particle swarm optimization PSO and Artificial bees colony ABC.


2019 ◽  
Vol 10 (1) ◽  
pp. 47-54
Author(s):  
Abdullah Jafari Chashmi ◽  
Mehdi Chehel Amirani

Abstract Primary recognition of heart diseases by exploiting computer aided diagnosis (CAD) machines, decreases the vast rate of fatality among cardiac patients. Recognition of heart abnormalities is a staggering task because the low changes in ECG signals may not be exactly specified with eyesight. In this paper, an efficient approach for ECG arrhythmia diagnosis is proposed based on a combination of discrete wavelet transform and higher order statistics feature extraction and entropy based feature selection methods. Using the neural network and support vector machine, five classes of heartbeat categories are classified. Applying the neural network and support vector machine method, our proposed system is able to classify the arrhythmia classes with high accuracy (99.83%) and (99.03%), respectively. The advantage of the presented procedure has been experimentally demonstrated compared to the other recently presented methods in terms of accuracy.


Sign in / Sign up

Export Citation Format

Share Document