scholarly journals An Efficient, Parallelized Algorithm for Optimal Conditional Entropy-Based Feature Selection

Entropy ◽  
2020 ◽  
Vol 22 (4) ◽  
pp. 492
Author(s):  
Gustavo Estrela ◽  
Marco Dimas Gubitoso ◽  
Carlos Eduardo Ferreira ◽  
Junior Barrera ◽  
Marcelo S. Reis

In Machine Learning, feature selection is an important step in classifier design. It consists of finding a subset of features that is optimum for a given cost function. One possibility to solve feature selection is to organize all possible feature subsets into a Boolean lattice and to exploit the fact that the costs of chains in that lattice describe U-shaped curves. Minimization of such cost function is known as the U-curve problem. Recently, a study proposed U-Curve Search (UCS), an optimal algorithm for that problem, which was successfully used for feature selection. However, despite of the algorithm optimality, the UCS required time in computational assays was exponential on the number of features. Here, we report that such scalability issue arises due to the fact that the U-curve problem is NP-hard. In the sequence, we introduce the Parallel U-Curve Search (PUCS), a new algorithm for the U-curve problem. In PUCS, we present a novel way to partition the search space into smaller Boolean lattices, thus rendering the algorithm highly parallelizable. We also provide computational assays with both synthetic data and Machine Learning datasets, where the PUCS performance was assessed against UCS and other golden standard algorithms in feature selection.

2021 ◽  
Author(s):  
Bing Xue ◽  
Mengjie Zhang ◽  
William Browne ◽  
X Yao

Feature selection is an important task in data miningand machine learning to reduce the dimensionality of the dataand increase the performance of an algorithm, such as a clas-sification algorithm. However, feature selection is a challengingtask due mainly to the large search space. A variety of methodshave been applied to solve feature selection problems, whereevolutionary computation techniques have recently gained muchattention and shown some success. However, there are no compre-hensive guidelines on the strengths and weaknesses of alternativeapproaches. This leads to a disjointed and fragmented fieldwith ultimately lost opportunities for improving performanceand successful applications. This paper presents a comprehensivesurvey of the state-of-the-art work on evolutionary computationfor feature selection, which identifies the contributions of thesedifferent algorithms. In addition, current issues and challengesare also discussed to identify promising areas for future research. Index Terms—Evolutionary computation, feature selection,classification, data mining, machine learning. © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.


2021 ◽  
Author(s):  
Bing Xue ◽  
Mengjie Zhang ◽  
William Browne ◽  
X Yao

Feature selection is an important task in data miningand machine learning to reduce the dimensionality of the dataand increase the performance of an algorithm, such as a clas-sification algorithm. However, feature selection is a challengingtask due mainly to the large search space. A variety of methodshave been applied to solve feature selection problems, whereevolutionary computation techniques have recently gained muchattention and shown some success. However, there are no compre-hensive guidelines on the strengths and weaknesses of alternativeapproaches. This leads to a disjointed and fragmented fieldwith ultimately lost opportunities for improving performanceand successful applications. This paper presents a comprehensivesurvey of the state-of-the-art work on evolutionary computationfor feature selection, which identifies the contributions of thesedifferent algorithms. In addition, current issues and challengesare also discussed to identify promising areas for future research. Index Terms—Evolutionary computation, feature selection,classification, data mining, machine learning. © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.


2021 ◽  
Vol 5 (3) ◽  
pp. 36
Author(s):  
Gabriella Kicska ◽  
Attila Kiss

Nowadays, the high-dimensionality of data causes a variety of problems in machine learning. It is necessary to reduce the feature number by selecting only the most relevant of them. Different approaches called Feature Selection are used for this task. In this paper, we propose a Feature Selection method that uses Swarm Intelligence techniques. Swarm Intelligence algorithms perform optimization by searching for optimal points in the search space. We show the usability of these techniques for solving Feature Selection and compare the performance of five major swarm algorithms: Particle Swarm Optimization, Artificial Bee Colony, Invasive Weed Optimization, Bat Algorithm, and Grey Wolf Optimizer. The accuracy of a decision tree classifier was used to evaluate the algorithms. It turned out that the dimension of the data can be reduced about two times without a loss in accuracy. Moreover, the accuracy increased when abandoning redundant features. Based on our experiments GWO turned out to be the best. It has the highest ranking on different datasets, and its average iteration number to find the best solution is 30.8. ABC obtained the lowest ranking on high-dimensional datasets.


2021 ◽  
Vol 15 (8) ◽  
pp. 912-926
Author(s):  
Ge Zhang ◽  
Pan Yu ◽  
Jianlin Wang ◽  
Chaokun Yan

Background: There have been rapid developments in various bioinformatics technologies, which have led to the accumulation of a large amount of biomedical data. However, these datasets usually involve thousands of features and include much irrelevant or redundant information, which leads to confusion during diagnosis. Feature selection is a solution that consists of finding the optimal subset, which is known to be an NP problem because of the large search space. Objective: For the issue, this paper proposes a hybrid feature selection method based on an improved chemical reaction optimization algorithm (ICRO) and an information gain (IG) approach, which called IGICRO. Methods: IG is adopted to obtain some important features. The neighborhood search mechanism is combined with ICRO to increase the diversity of the population and improve the capacity of local search. Results: Experimental results of eight public available data sets demonstrate that our proposed approach outperforms original CRO and other state-of-the-art approaches.


2021 ◽  
Vol 15 (4) ◽  
pp. 1-46
Author(s):  
Kui Yu ◽  
Lin Liu ◽  
Jiuyong Li

In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1226
Author(s):  
Saeed Najafi-Zangeneh ◽  
Naser Shams-Gharneh ◽  
Ali Arjomandi-Nezhad ◽  
Sarfaraz Hashemkhani Zolfani

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.


2021 ◽  
Vol 11 (4) ◽  
pp. 1742
Author(s):  
Ignacio Rodríguez-Rodríguez ◽  
José-Víctor Rodríguez ◽  
Wai Lok Woo ◽  
Bo Wei ◽  
Domingo-Javier Pardo-Quiles

Type 1 diabetes mellitus (DM1) is a metabolic disease derived from falls in pancreatic insulin production resulting in chronic hyperglycemia. DM1 subjects usually have to undertake a number of assessments of blood glucose levels every day, employing capillary glucometers for the monitoring of blood glucose dynamics. In recent years, advances in technology have allowed for the creation of revolutionary biosensors and continuous glucose monitoring (CGM) techniques. This has enabled the monitoring of a subject’s blood glucose level in real time. On the other hand, few attempts have been made to apply machine learning techniques to predicting glycaemia levels, but dealing with a database containing such a high level of variables is problematic. In this sense, to the best of the authors’ knowledge, the issues of proper feature selection (FS)—the stage before applying predictive algorithms—have not been subject to in-depth discussion and comparison in past research when it comes to forecasting glycaemia. Therefore, in order to assess how a proper FS stage could improve the accuracy of the glycaemia forecasted, this work has developed six FS techniques alongside four predictive algorithms, applying them to a full dataset of biomedical features related to glycaemia. These were harvested through a wide-ranging passive monitoring process involving 25 patients with DM1 in practical real-life scenarios. From the obtained results, we affirm that Random Forest (RF) as both predictive algorithm and FS strategy offers the best average performance (Root Median Square Error, RMSE = 18.54 mg/dL) throughout the 12 considered predictive horizons (up to 60 min in steps of 5 min), showing Support Vector Machines (SVM) to have the best accuracy as a forecasting algorithm when considering, in turn, the average of the six FS techniques applied (RMSE = 20.58 mg/dL).


Sign in / Sign up

Export Citation Format

Share Document