Benefiting feature selection by the discovery of false irrelevant attributes

Author(s):  
Lidia S. Chao ◽  
Derek F. Wong ◽  
Philip C. L. Chen ◽  
Wing W. Y. Ng ◽  
Daniel S. Yeung

The ordinary feature selection methods select only the explicit relevant attributes by filtering the irrelevant ones. They trade the selection accuracy for the execution time and complexity. In which, the hidden supportive information possessed by the irrelevant attributes may be lost, so that they may miss some good combinations. We believe that attributes are useless regarding the classification task by themselves, sometimes may provide potentially useful supportive information to other attributes and thus benefit the classification task. Such a strategy can minimize the information lost, therefore is able to maximize the classification accuracy. Especially for the dataset contains hidden interactions among attributes. This paper proposes a feature selection methodology from a new angle that selects not only the relevant features, but also targeting at the potentially useful false irrelevant attributes by measuring their supportive importance to other attributes. The empirical results validate the hypothesis by demonstrating that the proposed approach outperforms most of the state-of-the-art filter based feature selection methods.

Author(s):  
B. Venkatesh ◽  
J. Anuradha

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.


2016 ◽  
Vol 9 (2) ◽  
pp. 106
Author(s):  
Ratri Enggar Pawening ◽  
Tio Darmawan ◽  
Rizqa Raaiqa Bintana ◽  
Agus Zainal Arifin ◽  
Darlis Herumurti

Datasets with heterogeneous features can affect feature selection results that are not appropriate because it is difficult to evaluate heterogeneous features concurrently. Feature transformation (FT) is another way to handle heterogeneous features subset selection. The results of transformation from non-numerical into numerical features may produce redundancy to the original numerical features. In this paper, we propose a method to select feature subset based on mutual information (MI) for classifying heterogeneous features. We use unsupervised feature transformation (UFT) methods and joint mutual information maximation (JMIM) methods. UFT methods is used to transform non-numerical features into numerical features. JMIM methods is used to select feature subset with a consideration of the class label. The transformed and the original features are combined entirely, then determine features subset by using JMIM methods, and classify them using support vector machine (SVM) algorithm. The classification accuracy are measured for any number of selected feature subset and compared between UFT-JMIM methods and Dummy-JMIM methods. The average classification accuracy for all experiments in this study that can be achieved by UFT-JMIM methods is about 84.47% and Dummy-JMIM methods is about 84.24%. This result shows that UFT-JMIM methods can minimize information loss between transformed and original features, and select feature subset to avoid redundant and irrelevant features.


Author(s):  
Abimbola G Akintola ◽  
Abdullateef Balogun ◽  
Fatimah B Lafenwa-Balogun ◽  
Hameed A Mojeed

Classification techniques is a popular approach to predict software defects and it involves categorizing modules, which is represented by a set of metrics or code attributes into fault prone (FP) and non-fault prone (NFP) by means of a classification model. Nevertheless, there is existence of low quality, unreliable, redundant and noisy data which negatively affect the process of observing knowledge and useful pattern. Therefore, researchers need to retrieve relevant data from huge records using feature selection methods. Feature selection is the process of identifying the most relevant attributes and removing the redundant and irrelevant attributes. In this study, the researchers investigated the effect of filter feature selection on classification techniques in software defects prediction. Ten publicly available datasets of NASA and Metric Data Program software repository were used. The topmost discriminatory attributes of the dataset were evaluated using Principal Component Analysis (PCA), CFS and FilterSubsetEval. The datasets were classified by the selected classifiers which were carefully selected based on heterogeneity. Naïve Bayes was selected from Bayes category Classifier, KNN was selected from Instance Based Learner category, J48 Decision Tree from Trees Function classifier and Multilayer perceptron was selected from the neural network classifiers. The experimental results revealed that the application of feature selection to datasets before classification in software defects prediction is better and should be encouraged and Multilayer perceptron with FilterSubsetEval had the best accuracy. It can be concluded that feature selection methods are capable of improving the performance of learning algorithms in software defects prediction.


2018 ◽  
Vol 8 (3) ◽  
pp. 46-67 ◽  
Author(s):  
Mehrnoush Barani Shirzad ◽  
Mohammad Reza Keyvanpour

This article describes how feature selection for learning to rank algorithms has become an interesting issue. While noisy and irrelevant features influence performance, and result in an overfitting problem in ranking systems, reducing the number of features by illuminating irrelevant and noisy features is a solution. Several studies have applied feature selection for learning to rank, which promote efficiency and effectiveness of ranking models. As the number of features and consequently the number of irrelevant and noisy features is increasing, systematic a review of Feature selection for learning to rank methods is required. In this article, a framework to examine research on feature selection for learning to rank (FSLR) is proposed. Under this framework, the authors review the most state-of-the-art methods and suggest several criteria to analyze them. FSLR offers a structured classification of current algorithms for future research to: a) properly select strategies from existing algorithms using certain criteria or b) to find ways to develop existing methodologies.


Author(s):  
Heba F. Eid ◽  
Mostafa A. Salama ◽  
Aboul Ella Hassanien

Feature selection is a preprocessing step to machine learning, leads to increase the classification accuracy and reduce its complexity. Feature selection methods are classified into two main categories: filter and wrapper. Filter methods evaluate features without involving any learning algorithm, while wrapper methods depend on a learning algorithm for feature evaluation. Variety hybrid Filter and wrapper methods have been proposed in the literature. However, hybrid filter and wrapper approaches suffer from the problem of determining the cut-off point of the ranked features. This leads to decrease the classification accuracy by eliminating important features. In this paper the authors proposed a Hybrid Bi-Layer behavioral-based feature selection approach, which combines filter and wrapper feature selection methods. The proposed approach solves the cut-off point problem for the ranked features. It consists of two layers, at the first layer Information gain is used to rank the features and select a new set of features depending on a global maxima classification accuracy. Then, at the second layer a new subset of features is selected from within the first layer redacted data set by searching for a group of local maximum classification accuracy. To evaluate the proposed approach it is applied on NSL-KDD dataset, where the number of features is reduced from 41 to 34 features at the first layer. Then reduced from 34 to 20 features at the second layer, which leads to improve the classification accuracy to 99.2%.


Author(s):  
Mohsin Iqbal ◽  
Saif Ur Rehman ◽  
Saira Gillani ◽  
Sohail Asghar

The key objective of the chapter would be to study the classification accuracy, using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of the learning algorithm improved. We test how an integrated feature selection could affect the accuracy of three classifiers by performing feature selection methods. The filter effects show that Information Gain (IG), Gain Ratio (GR) and Relief-f, and wrapper effect show that Bagging and Naive Bayes (NB), enabled the classifiers to give the highest escalation in classification accuracy about the average while reducing the volume of unnecessary attributes. The achieved conclusions can advise the machine learning users, which classifier and feature selection methods to use to optimize the classification accuracy, and this can be important, especially at risk-sensitive applying Machine Learning whereas in the one of the aim to reduce costs of collecting, processing and storage of unnecessary data.


2021 ◽  
pp. 1-11
Author(s):  
Carolina Martín-del-Campo-Rodríguez ◽  
Grigori Sidorov ◽  
Ildar Batyrshin

This paper presents a computational model for the unsupervised authorship attribution task based on a traditional machine learning scheme. An improvement over the state of the art is achieved by comparing different feature selection methods on the PAN17 author clustering dataset. To achieve this improvement, specific pre-processing and features extraction methods were proposed, such as a method to separate tokens by type to assign them to only one category. Similarly, special characters are used as part of the punctuation marks to improve the result obtained when applying typed character n-grams. The Weighted cosine similarity measure is applied to improve the B 3 F-score by reducing the vector values where attributes are exclusive. This measure is used to define distances between documents, which later are occupied by the clustering algorithm to perform authorship attribution.


Entropy ◽  
2020 ◽  
Vol 22 (7) ◽  
pp. 797
Author(s):  
Ping Zhang ◽  
Wanfu Gao ◽  
Juncheng Hu ◽  
Yonghao Li

Multi-label data often involve features with high dimensionality and complicated label correlations, resulting in a great challenge for multi-label learning. Feature selection plays an important role in multi-label learning to address multi-label data. Exploring label correlations is crucial for multi-label feature selection. Previous information-theoretical-based methods employ the strategy of cumulative summation approximation to evaluate candidate features, which merely considers low-order label correlations. In fact, there exist high-order label correlations in label set, labels naturally cluster into several groups, similar labels intend to cluster into the same group, different labels belong to different groups. However, the strategy of cumulative summation approximation tends to select the features related to the groups containing more labels while ignoring the classification information of groups containing less labels. Therefore, many features related to similar labels are selected, which leads to poor classification performance. To this end, Max-Correlation term considering high-order label correlations is proposed. Additionally, we combine the Max-Correlation term with feature redundancy term to ensure that selected features are relevant to different label groups. Finally, a new method named Multi-label Feature Selection considering Max-Correlation (MCMFS) is proposed. Experimental results demonstrate the classification superiority of MCMFS in comparison to eight state-of-the-art multi-label feature selection methods.


2019 ◽  
Author(s):  
Gopi Battineni ◽  
Francesco Amenta ◽  
Nalini Chintalapudi

Abstract Background: Predictions on cancer development were getting better through Machine Learning (ML) methods compared to pathologists. This study is aim to develop a ML model based feature selection methodology to get accurately predict the breast cancers Methods: We consider database of 569 breast cancer cases and diagnosed with 212 malignant and 357 benign and this study adopt with logistic regression (LR) ML type that had done with total feature set and selective feature set. Results: Outcomes produces cancer prediction accuracy 90.86% with total features and 93.84% with selective features. In addition, we validated the results with Area under Curve (AUC) for the selective feature set Receiver Optimistic curve (ROC) curve is greater (99.8%) than AUC for the total feature set ROC (96.2%). In precise, LR model did accurate tumor classification with selective features. Conclusions: It is very important to address issues of cancer tumor classifications by ML and it will help pathologists to provide preventive care for the patients. However, it is important to involve other features of cancer in order to understand the further causes and decision-making process that should be help in much accurately classified.


Author(s):  
Sergio Damian ◽  
Hiram Calvo ◽  
Alexander Gelbukh

The paper presents a classifier for fake news spreaders detection in social media. Detecting fake news spreaders is an important task because this kind of disinformation aims to change the reader’s opinion about a relevant topic for the society. This work presents a classifier that can compete with the ones that are found in the state-of-the-art. In addition, this work applies Explainable Artificial Intelligence (XIA) methods in order to understand the corpora used and how the model estimates results. The work focuses on the corpora developed by members of the PAN@CLEF 2020 competition. The score obtained surpasses the state-of-the-art with a mean accuracy score of 0.7825. The solution uses XIA methods for the feature selection process, since they present more stability to the selection than most of traditional feature selection methods. Also, this work concludes that the detection done by the solution approach is generally based on the topic of the text.


Sign in / Sign up

Export Citation Format

Share Document