scholarly journals Joint Semi-Supervised Feature Selection and Classification through Bayesian Approach

Author(s):  
Bingbing Jiang ◽  
Xingyu Wu ◽  
Kui Yu ◽  
Huanhuan Chen

With the increasing data dimensionality, feature selection has become a fundamental task to deal with high-dimensional data. Semi-supervised feature selection focuses on the problem of how to learn a relevant feature subset in the case of abundant unlabeled data with few labeled data. In recent years, many semi-supervised feature selection algorithms have been proposed. However, these algorithms are implemented by separating the processes of feature selection and classifier training, such that they cannot simultaneously select features and learn a classifier with the selected features. Moreover, they ignore the difference of reliability inside unlabeled samples and directly use them in the training stage, which might cause performance degradation. In this paper, we propose a joint semi-supervised feature selection and classification algorithm (JSFS) which adopts a Bayesian approach to automatically select the relevant features and simultaneously learn a classifier. Instead of using all unlabeled samples indiscriminately, JSFS associates each unlabeled sample with a self-adjusting weight to distinguish the difference between them, which can effectively eliminate the irrelevant unlabeled samples via introducing a left-truncated Gaussian prior. Experiments on various datasets demonstrate the effectiveness and superiority of JSFS.

2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Jin-Jia Wang ◽  
Fang Xue ◽  
Hui Li

Feature extraction and classification of EEG signals are core parts of brain computer interfaces (BCIs). Due to the high dimension of the EEG feature vector, an effective feature selection algorithm has become an integral part of research studies. In this paper, we present a new method based on a wrapped Sparse Group Lasso for channel and feature selection of fused EEG signals. The high-dimensional fused features are firstly obtained, which include the power spectrum, time-domain statistics, AR model, and the wavelet coefficient features extracted from the preprocessed EEG signals. The wrapped channel and feature selection method is then applied, which uses the logistical regression model with Sparse Group Lasso penalized function. The model is fitted on the training data, and parameter estimation is obtained by modified blockwise coordinate descent and coordinate gradient descent method. The best parameters and feature subset are selected by using a 10-fold cross-validation. Finally, the test data is classified using the trained model. Compared with existing channel and feature selection methods, results show that the proposed method is more suitable, more stable, and faster for high-dimensional feature fusion. It can simultaneously achieve channel and feature selection with a lower error rate. The test accuracy on the data used from international BCI Competition IV reached 84.72%.


2021 ◽  
Vol 11 (1) ◽  
pp. 35-43
Author(s):  
Wen Xin Ng ◽  
Weng Howe Chan

In healthcare, biomarkers serve an important role in disease classification. Many existing works are focusing in identifying potential biomarkers from gene expression. Moreover, the large number of redundant features in a high dimensional dataset such as gene expression would introduce bias in the classifier and reduce the classifier’s performance. Embedded feature selection methods such as ranked guided iterative feature elimination have been widely adopted owing to the good performance in identification of informative features. However, method like ranked guided iterative feature elimination does not consider the redundancy of the features. Thus, this paper proposes an improved ranked guided iterative feature elimination method by introducing an additional filter selection based on minimum redundancy maximum relevance to filter out redundant features and maintain the relevant feature subset to be ranked and used for classification. Experiments are done using two gene expression datasets for prostate cancer and central nervous system. The performance of the classification is measured in terms of accuracy and compared with existing methods. Meanwhile, biological context verification of the identified features is done through available knowledge databases. Our method shows improved classification accuracy, and the selected genes were found to have relationship with the diseases.


Symmetry ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1782
Author(s):  
Supailin Pichai ◽  
Khamron Sunat ◽  
Sirapat Chiewchanwattana

This paper presents a method for feature selection in a high-dimensional classification context. The proposed method finds a candidate solution based on quality criteria using subset searching. In this study, the competitive swarm optimization (CSO) algorithm was implemented to solve feature selection problems in high-dimensional data. A new asymmetric chaotic function was proposed and used to generate the population and search for a CSO solution. Its histogram is right-skewed. The proposed method is named an asymmetric chaotic competitive swarm optimization algorithm (ACCSO). According to the asymmetrical property of the proposed chaotic map, ACCSO prefers zero than one. Therefore, the solution is very compact and can achieve high classification accuracy with a minimal feature subset for high-dimensional datasets. The proposed method was evaluated on 12 datasets, with dimensions ranging from 4 to 10,304. ACCSO was compared to the original CSO algorithm and other metaheuristic algorithms. Experimental results show that the proposed method can increase accuracy and it reduces the number of selected features. Compared to different optimization algorithms with other wrappers, the proposed method exhibits excellent performance.


2020 ◽  
Vol 43 (1) ◽  
pp. 103-125
Author(s):  
Yi Zhong ◽  
Jianghua He ◽  
Prabhakar Chalise

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.


2016 ◽  
Vol 2016 ◽  
pp. 1-6 ◽  
Author(s):  
Gürcan Yavuz ◽  
Doğan Aydin

Optimal feature subset selection is an important and a difficult task for pattern classification, data mining, and machine intelligence applications. The objective of the feature subset selection is to eliminate the irrelevant and noisy feature in order to select optimum feature subsets and increase accuracy. The large number of features in a dataset increases the computational complexity thus leading to performance degradation. In this paper, to overcome this problem, angle modulation technique is used to reduce feature subset selection problem to four-dimensional continuous optimization problem instead of presenting the problem as a high-dimensional bit vector. To present the effectiveness of the problem presentation with angle modulation and to determine the efficiency of the proposed method, six variants of Artificial Bee Colony (ABC) algorithms employ angle modulation for feature selection. Experimental results on six high-dimensional datasets show that Angle Modulated ABC algorithms improved the classification accuracy with fewer feature subsets.


2021 ◽  
Author(s):  
Jing Wang ◽  
Yuanzi Zhang ◽  
Minglin Hong ◽  
Haiyang He ◽  
Shiguo Huang

Abstract Feature selection is an important data preprocessing method in data mining and machine learning, yet it faces the challenge of “curse of dimensionality” when dealing with high-dimensional data. In this paper, a self-adaptive level-based learning artificial bee colony (SLLABC) algorithm is proposed for high-dimensional feature selection problem. The SLLABC algorithm includes three new mechanisms: (1) A novel level-based learning mechanism is introduced to accelerate the convergence of the basic artificial bee colony algorithm, which divides the population into several levels and the individuals on each level learn from the individuals on higher levels, especially, the individuals on the highest level learn from each other. (2) A self-adaptive method is proposed to keep the balance between exploration and exploitation abilities, which takes the diversity of population into account to determine the number of levels. The lower the diversity is, the fewer the levels are divided. (3) A new update mechanism is proposed to reduce the number of selected features. In this mechanism, if the error rate of an offspring is higher than or is equal to that of its parent but selects more features, then the offspring is discarded and the parent is retained, otherwise, the offspring replaces its parent. Further, we discuss and analyze the contribution of these novelties to the diversity of population and the performance of classification. Finally, the results, compared with 8 state-of-the-art algorithms on 12 high-dimensional datasets, confirm the competitive performance of the proposed SLLABC on both classification accuracy and the size of the feature subset.


2018 ◽  
Author(s):  
João Emanoel Ambrósio Gomes ◽  
Ricardo B. C. Prudêncio ◽  
André C. A. Nascimento

Group profiling methods aim to construct a descriptive profile for communities in social networks. Before the application of a profiling algorithm, it is necessary to collect and preprocess the users’ content information, i.e., to build a representation of each user in the network. Usually, existing group profiling strategies define the users’ representation by uniformly processing the entire content information in the network, and then, apply traditional feature selection methods over the user features in a group. However, such strategy may ignore specific characteristics of each group. This fact can lead to a limited representation for some communities, disregarding attributes which are relevant to the network perspective and describing more clearly a particular community despite the others. In this context, we propose the community-based user’s representation method (CUR). In this proposal, feature selection algorithms are applied over user features for each network community individually, aiming to assign relevant feature sets for each particular community. Such strategy will avoid the bias caused by larger communities on the overall user representation. Experiments were conducted in a co-authorship network to evaluate the CUR representation on different group profiling strategies and were assessed by hu- man evaluators. The results showed that profiles obtained after the application of the CUR module were better than the ones obtained by conventional users’ representation on an average of 76.54% of the evaluations.


2019 ◽  
Vol 35 (1) ◽  
pp. 9-14 ◽  
Author(s):  
P. S. Maya Gopal ◽  
R Bhargavi

Abstract. In agriculture, crop yield prediction is critical. Crop yield depends on various features which can be categorized as geographical, climatic, and biological. Geographical features consist of cultivable land in hectares, canal length to cover the cultivable land, number of tanks and tube wells available for irrigation. Climatic features consist of rainfall, temperature, and radiation. Biological features consist of seeds, minerals, and nutrients. In total, 15 features were considered for this study to understand features impact on paddy crop yield for all seasons of each year. For selecting vital features, five filter and wrapper approaches were applied. For predicting accuracy of features selection algorithm, Multiple Linear Regression (MLR) model was used. The RMSE, MAE, R, and RRMSE metrics were used to evaluate the performance of feature selection algorithms. Data used for the analysis was drawn from secondary sources of state Agriculture Department, Government of Tamil Nadu, India, for over 30 years. Seventy-five percent of data was used for training and 25% was used for testing. Low computational time was also considered for the selection of best feature subset. Outcome of all feature selection algorithms have given similar results in the RMSE, RRMSE, R, and MAE values. The adjusted R2 value was used to find the optimum feature subset despite all the deviations. The evaluation of the dataset used in this work shows that total area of cultivation, number of tanks and open wells used for irrigation, length of canals used for irrigation, and average maximum temperature during the season of the crop are the best features for better crop yield prediction on the study area. The MLR gives 85% of model accuracy for the selected features with low computational time. Keywords: Feature selection algorithm, Model validation, Multiple linear regression, Performance metrics.


Sign in / Sign up

Export Citation Format

Share Document