New feature selection based on kernel

Feature selection is an essential issue in machine learning. It discards the unnecessary or redundant features in the dataset. This paper introduced the new feature selection based on kernel function using 16 the real-world datasets from UCI data repository, and k-means clustering was utilized as the classifier using radial basis function (RBF) and polynomial kernel function. After sorting the features using the new feature selection, 75 percent of it was examined and evaluated using 10-fold cross-validation, then the accuracy, F1-Score, and running time were compared. From the experiments, it was concluded that the performance of the new feature selection based on RBF kernel function varied according to the value of the kernel parameter, opposite with the polynomial kernel function. Moreover, the new feature selection based on RBF has a faster running time compared to the polynomial kernel function. Besides, the proposed method has higher accuracy and F1-Score until 40 percent difference in several datasets compared to the commonly used feature selection techniques such as Fisher score, Chi-Square test, and Laplacian score. Therefore, this method can be considered to use for feature selection

Download Full-text

FIVE NEW FEATURE SELECTION METRICS IN TEXT CATEGORIZATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005831 ◽

2007 ◽

Vol 21 (06) ◽

pp. 1085-1101 ◽

Cited By ~ 5

Author(s):

FENGXI SONG ◽

DAVID ZHANG ◽

YONG XU ◽

JIZHONG WANG

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Chi Square ◽

Document Frequency ◽

Feature Spaces ◽

Text Feature ◽

Statistical Pattern ◽

Individual Features ◽

New Feature ◽

Data Collections

Feature selection has been extensively applied in statistical pattern recognition as a mechanism for cleaning up the set of features that are used to represent data and as a way of improving the performance of classifiers. Four schemes commonly used for feature selection are Exponential Searches, Stochastic Searches, Sequential Searches, and Best Individual Features. The most popular scheme used in text categorization is Best Individual Features as the extremely high dimensionality of text feature spaces render the other three feature selection schemes time prohibitive. This paper proposes five new metrics for selecting Best Individual Features for use in text categorization. Their effectiveness have been empirically tested on two well- known data collections, Reuters-21578 and 20 Newsgroups. Experimental results show that the performance of two of the five new metrics, Bayesian Rule and F-one Value, is not significantly below that of a good traditional text categorization selection metric, Document Frequency. The performance of another two of these five new metrics, Low Loss Dimensionality Reduction and Relative Frequency Difference, is equal to or better than that of conventional good feature selection metrics such as Mutual Information and Chi-square Statistic.

Download Full-text

English Text Classification Using Improved Recursive Feature Elimination (IRFE) Algorithm: تصنيف النص الإنجليزي باستخدام الخوارزمية العودية المحسنة لإزالة الخواص (IRFE)

Journal of engineering sciences and information technology - مجلة العلوم الهندسية و تكنولوجيا المعلومات ◽

10.26389/ajsrp.r080420 ◽

2020 ◽

Vol 4 (2) ◽

Author(s):

Esraa H. Abd Al-Ameer, Ahmed H. Aliwy

Keyword(s):

Feature Selection ◽

Language Processing ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

English Text ◽

Recursive Feature Elimination ◽

Chi Square ◽

Data Set ◽

New Feature

Documents classification is from most important fields for Natural language processing and text mining. There are many algorithms can be used for this task. In this paper, focuses on improving Text Classification by feature selection. This means determine some of the original features without affecting the accuracy of the work, where our work is a new feature selection method was suggested which can be a general formulation and mathematical model of Recursive Feature Elimination (RFE). The used method was compared with other two well-known feature selection methods: Chi-square and threshold. The results proved that the new method is comparable with the other methods, The best results were 83% when 60% of features used, 82% when 40% of features used, and 82% when 20% of features used. The tests were done with the Naïve Bayes (NB) and decision tree (DT) classification algorithms , where the used dataset is a well-known English data set “20 newsgroups text” consists of approximately 18846 files. The results showed that our suggested feature selection method is comparable with standard Like Chi-square.

Download Full-text

Prediction of Bankruptcy with SVM Classifiers Among Retail Business Companies in EU

Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis ◽

10.11118/actaun201664020627 ◽

2016 ◽

Vol 64 (2) ◽

pp. 627-634 ◽

Cited By ~ 5

Author(s):

Václav Klepáč ◽

David Hampel

Keyword(s):

Feature Selection ◽

Prediction Accuracy ◽

Bankruptcy Prediction ◽

Information Value ◽

Financial Data ◽

Polynomial Kernel ◽

Support Vector ◽

Chi Square ◽

Relief Algorithm ◽

Retail Business

Article focuses on the prediction of bankruptcy of the 850 medium-sized retail business companies in EU from which 48 companies gone bankrupt in 2014 with respect to lag of the used features. From various types of classification models we chose Support vector machines method with linear, polynomial and radial kernels to acquire best results. Pre-processing is enhanced with filter based feature selection like Gain ratio, Chi-square and Relief algorithm to acquire attributes with the best information value. On this basis we deal with random samples of financial data to measure prediction accuracy with the confusion matrices and area under curve values for different kernel types and selected features. From the results it is obvious that with the rising distance to the bankruptcy there drops precision of bankruptcy prediction. The last year (2013) with avaible financial data offers best total prediction accuracy, thus we also infer both the Error I and II types for better recognizance. The 3rd order polynomial kernel offers better accuracy for bankruptcy prediction than linear and radial versions. But in terms of the total accuracy we recommend to use radial kernel without feature selection.

Download Full-text

A New Feature Selection Scheme for Emotion Recognition from Text

Applied Sciences ◽

10.3390/app10155351 ◽

2020 ◽

Vol 10 (15) ◽

pp. 5351

Author(s):

Zafer Erenel ◽

Oluwatayomi Rereloluwa Adegboye ◽

Huseyin Kusetogullari

Keyword(s):

Feature Selection ◽

Emotion Recognition ◽

Bag Of Words ◽

Document Representation ◽

Chi Square ◽

Selection Scheme ◽

Term Selection ◽

Selection Measures ◽

New Feature ◽

Better Than

This paper presents a new scheme for term selection in the field of emotion recognition from text. The proposed framework is based on utilizing moderately frequent terms during term selection. More specifically, all terms are evaluated by considering their relevance scores, based on the idea that moderately frequent terms may carry valuable information for discrimination as well. The proposed feature selection scheme performs better than conventional filter-based feature selection measures Chi-Square and Gini-Text in numerous cases. The bag-of-words approach is used to construct the vectors for document representation where each selected term is assigned the weight 1 if it exists or assigned the weight 0 if it does not exist in the document. The proposed scheme includes the terms that are not selected by Chi-Square and Gini-Text. Experiments conducted on a benchmark dataset show that moderately frequent terms boost the representation power of the term subsets as noticeable improvements are observed in terms of Accuracies.

Download Full-text

New Feature Selection Method for Multi-channel EEG Epileptic Spike Detection System

VNU Journal of Science Computer Science and Communication Engineering ◽

10.25073/2588-1086/vnucsce.230 ◽

2019 ◽

Vol 35 (2) ◽

Author(s):

Nguyen Thi Anh Dao ◽

Le Trung Thanh ◽

Viet-Dung Nguyen ◽

Nguyen Linh-Trung ◽

Ha Vu Le

Keyword(s):

Feature Selection ◽

Detection System ◽

Feature Selection Method ◽

Tensor Decomposition ◽

Selection Method ◽

P Value ◽

Selection Methods ◽

Fisher Score ◽

Epileptic Spikes ◽

New Feature

Epilepsy is one of the most common and severe brain disorders. Electroencephalogram (EEG) is widely used in epilepsy diagnosis and treatment, with it the epileptic spikes can be observed. Tensor decomposition-based feature extraction has been proposed to facilitate automatic detection of EEG epileptic spikes. However, tensor decomposition may still result in a large number of features which are considered negligible in determining expected output performance. We proposed a new feature selection method that combines the Fisher score and p-value feature selection methods to rank the features by using the longest common sequences (LCS) to separate epileptic and non-epileptic spikes. The proposed method significantly outperformed several state-of-the-art feature selection methods.

Download Full-text

Analysis of Feature Selection and Ensemble Classifier Methods for Intrusion Detection

International Journal of Natural Computing Research ◽

10.4018/ijncr.2018010104 ◽

2018 ◽

Vol 7 (1) ◽

pp. 57-72

Author(s):

H.P. Vinutha ◽

Poornima Basavaraju

Keyword(s):

Feature Selection ◽

Intrusion Detection ◽

Detection Rate ◽

Information Gain ◽

False Positive Rate ◽

Ensemble Classifier ◽

Ensemble Classification ◽

Chi Square ◽

Traffic Pattern ◽

Data Mining Algorithms

Day by day network security is becoming more challenging task. Intrusion detection systems (IDSs) are one of the methods used to monitor the network activities. Data mining algorithms play a major role in the field of IDS. NSL-KDD'99 dataset is used to study the network traffic pattern which helps us to identify possible attacks takes place on the network. The dataset contains 41 attributes and one class attribute categorized as normal, DoS, Probe, R2L and U2R. In proposed methodology, it is necessary to reduce the false positive rate and improve the detection rate by reducing the dimensionality of the dataset, use of all 41 attributes in detection technology is not good practices. Four different feature selection methods like Chi-Square, SU, Gain Ratio and Information Gain feature are used to evaluate the attributes and unimportant features are removed to reduce the dimension of the data. Ensemble classification techniques like Boosting, Bagging, Stacking and Voting are used to observe the detection rate separately with three base algorithms called Decision stump, J48 and Random forest.

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text