optimal feature subset
Recently Published Documents


TOTAL DOCUMENTS

114
(FIVE YEARS 53)

H-INDEX

11
(FIVE YEARS 3)

2022 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Deepti Sisodia ◽  
Dilip Singh Sisodia

PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.


2022 ◽  
Vol 13 (1) ◽  
pp. 0-0

Feature selection is performed to eliminate irrelevant features to reduce computational overheads. Metaheuristic algorithms have become popular for the task of feature selection due to their effectiveness and flexibility. Hybridization of two or more such metaheuristics has become popular in solving optimization problems. In this paper, we propose a hybrid wrapper feature selection technique based on binary butterfly optimization algorithm (bBOA) and Simulated Annealing (SA). The SA is combined with the bBOA in a pipeline fashion such that the best solution obtained by the bBOA is passed on to the SA for further improvement. The SA solution improves the best solution obtained so far by searching in its neighborhood. Thus the SA tries to enhance the exploitation property of the bBOA. The proposed method is tested on twenty datasets from the UCI repository and the results are compared with five popular algorithms for feature selection. The results confirm the effectiveness of the hybrid approach in improving the classification accuracy and selecting the optimal feature subset.


Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8370
Author(s):  
Ala Hag ◽  
Dini Handayani ◽  
Maryam Altalhi ◽  
Thulasyammal Pillai ◽  
Teddy Mantoro ◽  
...  

In real-life applications, electroencephalogram (EEG) signals for mental stress recognition require a conventional wearable device. This, in turn, requires an efficient number of EEG channels and an optimal feature set. This study aims to identify an optimal feature subset that can discriminate mental stress states while enhancing the overall classification performance. We extracted multi-domain features within the time domain, frequency domain, time-frequency domain, and network connectivity features to form a prominent feature vector space for stress. We then proposed a hybrid feature selection (FS) method using minimum redundancy maximum relevance with particle swarm optimization and support vector machines (mRMR-PSO-SVM) to select the optimal feature subset. The performance of the proposed method is evaluated and verified using four datasets, namely EDMSS, DEAP, SEED, and EDPMSC. To further consolidate, the effectiveness of the proposed method is compared with that of the state-of-the-art metaheuristic methods. The proposed model significantly reduced the features vector space by an average of 70% compared with the state-of-the-art methods while significantly increasing overall detection performance.


Biosensors ◽  
2021 ◽  
Vol 11 (12) ◽  
pp. 499
Author(s):  
Chien-Te Wu ◽  
Hao-Chuan Huang ◽  
Shiuan Huang ◽  
I-Ming Chen ◽  
Shih-Cheng Liao ◽  
...  

Major depressive disorder (MDD) is a global healthcare issue and one of the leading causes of disability. Machine learning combined with non-invasive electroencephalography (EEG) has recently been shown to have the potential to diagnose MDD. However, most of these studies analyzed small samples of participants recruited from a single source, raising serious concerns about the generalizability of these results in clinical practice. Thus, it has become critical to re-evaluate the efficacy of various common EEG features for MDD detection across large and diverse datasets. To address this issue, we collected resting-state EEG data from 400 participants across four medical centers and tested classification performance of four common EEG features: band power (BP), coherence, Higuchi’s fractal dimension, and Katz’s fractal dimension. Then, a sequential backward selection (SBS) method was used to determine the optimal subset. To overcome the large data variability due to an increased data size and multi-site EEG recordings, we introduced the conformal kernel (CK) transformation to further improve the MDD as compared with the healthy control (HC) classification performance of support vector machine (SVM). The results show that (1) coherence features account for 98% of the optimal feature subset; (2) the CK-SVM outperforms other classifiers such as K-nearest neighbors (K-NN), linear discriminant analysis (LDA), and SVM; (3) the combination of the optimal feature subset and CK-SVM achieves a high five-fold cross-validation accuracy of 91.07% on the training set (140 MDD and 140 HC) and 84.16% on the independent test set (60 MDD and 60 HC). The current results suggest that the coherence-based connectivity is a more reliable feature for achieving high and generalizable MDD detection performance in real-life clinical practice.


2021 ◽  
Vol 12 ◽  
Author(s):  
Dongxu Zhao ◽  
Zhixia Teng ◽  
Yanjuan Li ◽  
Dong Chen

Recently, several anti-inflammatory peptides (AIPs) have been found in the process of the inflammatory response, and these peptides have been used to treat some inflammatory and autoimmune diseases. Therefore, identifying AIPs accurately from a given amino acid sequences is critical for the discovery of novel and efficient anti-inflammatory peptide-based therapeutics and the acceleration of their application in therapy. In this paper, a random forest-based model called iAIPs for identifying AIPs is proposed. First, the original samples were encoded with three feature extraction methods, including g-gap dipeptide composition (GDC), dipeptide deviation from the expected mean (DDE), and amino acid composition (AAC). Second, the optimal feature subset is generated by a two-step feature selection method, in which the feature is ranked by the analysis of variance (ANOVA) method, and the optimal feature subset is generated by the incremental feature selection strategy. Finally, the optimal feature subset is inputted into the random forest classifier, and the identification model is constructed. Experiment results showed that iAIPs achieved an AUC value of 0.822 on an independent test dataset, which indicated that our proposed model has better performance than the existing methods. Furthermore, the extraction of features for peptide sequences provides the basis for evolutionary analysis. The study of peptide identification is helpful to understand the diversity of species and analyze the evolutionary history of species.


Author(s):  
ALA HAG ◽  
Dini Handayani ◽  
Maryam Altalhi ◽  
Thulasyammal Pillai ◽  
Teddy Mantoro ◽  
...  

Mental stress state recognition using electroencephalogram (EEG) signals for real-life applications needs a conventional wearable device. This requires an efficient number of EEG channels and an optimal feature set. The main objective of the study is to identify an optimal feature subset that can best discriminate mental stress states while enhancing the overall performance. Thus, multi-domain feature extraction methods were employed, namely, time domain, frequency domain, time-frequency domain, and network connectivity features, to form a large feature vector space. To avoid the computational complexity of high dimensional space, a hybrid feature selection (FS) method of minimum Redundancy Maximum Relevance with Particle Swarm Optimization and Support Vector Machine (mRMR-PSO-SVM) is proposed to remove noise, redundant, and irrelevant features and keep the optimal feature subset. The performance of the proposed method is evaluated and verified using four datasets, namely EDMSS, DEAP, SEED, and EDPMSC. To further consolidate, the effectiveness of the proposed method is compared with that of the state-of-the-art heuristic methods. The proposed model has significantly reduced the features vector space by an average of 70% in comparison to the state-of-the-art methods while significantly increasing overall detection performance.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Jia Yun-Tao ◽  
Zhang Wan-Qiu ◽  
He Chun-Lin

For high-dimensional data with a large number of redundant features, existing feature selection algorithms still have the problem of “curse of dimensionality.” In view of this, the paper studies a new two-phase evolutionary feature selection algorithm, called clustering-guided integer brain storm optimization algorithm (IBSO-C). In the first phase, an importance-guided feature clustering method is proposed to group similar features, so that the search space in the second phase can be reduced obviously. The second phase applies oneself to finding optimal feature subset by using an improved integer brain storm optimization. Moreover, a new encoding strategy and a time-varying integer update method for individuals are proposed to improve the search performance of brain storm optimization in the second phase. Since the number of feature clusters is far smaller than the size of original features, IBSO-C can find an optimal feature subset fast. Compared with several existing algorithms on some real-world datasets, experimental results show that IBSO-C can find feature subset with high classification accuracy at less computation cost.


2021 ◽  
pp. 1-18
Author(s):  
Zhang Zixian ◽  
Liu Xuning ◽  
Li Zhixiang ◽  
Hu Hongqiang

The influencing factors of coal and gas outburst are complex, now the accuracy and efficiency of outburst prediction and are not high, in order to obtain the effective features from influencing factors and realize the accurate and fast dynamic prediction of coal and gas outburst, this article proposes an outburst prediction model based on the coupling of feature selection and intelligent optimization classifier. Firstly, in view of the redundancy and irrelevance of the influencing factors of coal and gas outburst, we use Boruta feature selection method obtain the optimal feature subset from influencing factors of coal and gas outburst. Secondly, based on Apriori association rules mining method, the internal association relationship between coal and gas outburst influencing factors is mined, and the strong association rules existing in the influencing factors and samples that affect the classification of coal and gas outburst are extracted. Finally, svm is used to classify coal and gas outbursts based on the above obtained optimal feature subset and sample data, and Bayesian optimization algorithm is used to optimize the kernel parameters of svm, and the coal and gas outburst pattern recognition prediction model is established, which is compared with the existing coal and gas outbursts prediction model in literatures. Compared with the method of feature selection and association rules mining alone, the proposed model achieves the highest prediction accuracy of 93% when the feature dimension is 3, which is higher than that of Apriori association rules and Boruta feature selection, and the classification accuracy is significantly improved, However, the feature dimension decreased significantly; The results show that the proposed model is better than other prediction models, which further verifies the accuracy and applicability of the coupling prediction model, and has high stability and robustness.


2021 ◽  
pp. 1-18
Author(s):  
Rikta Sen ◽  
Saptarsi Goswami ◽  
Ashis Kumar Mandal ◽  
Basabi Chakraborty

Jeffries-Matusita (JM) distance, a transformation of the Bhattacharyya distance, is a widely used measure of the spectral separability distance between the two class density functions and is generally used as a class separability measure. It can be considered to have good potential to be used for evaluation of the effectiveness of a feature in discriminating two classes. The capability of JM distance as a ranking based feature selection technique for binary classification problems has been verified in some research works as well as in our earlier work. It was found by our simulation experiments with benchmark data sets that JM distance works equally well compared to other popular feature ranking methods based on mutual information, information gain or Relief. Extension of JM distance measure for feature ranking in multiclass problems has also been reported in the literature. But all of them are basically rank based approaches which deliver the ranking of the features and do not automatically produce the final optimal feature subset. In this work, a novel heuristic approach for finding out the optimum feature subset from JM distance based ranked feature lists for multiclass problems have been developed without explicitly using any specific search technique. The proposed approach integrates the extension of JM measure for multiclass problems and the selection of the final optimal feature subset in a unified process. The performance of the proposed algorithm has been evaluated by simulation experiments with benchmark data sets in comparison with two other previously developed rank based feature selection algorithms with multiclass JM distance measures (weighted average JM distance and another multiclass extension equivalent to Bhattacharyya bound) and some other popular filter based feature ranking algorithms. It is found that the proposed algorithm performs better in terms of classification accuracy, F-measure, AUC with a reduced set of features and computational cost.


2021 ◽  
Vol 12 ◽  
Author(s):  
Guoqing Liu ◽  
Shuangjian Song ◽  
Qiguo Zhang ◽  
Biyu Dong ◽  
Yu Sun ◽  
...  

Characterization and identification of recombination hotspots provide important insights into the mechanism of recombination and genome evolution. In contrast with existing sequence-based models for predicting recombination hotspots which were defined in a ORF-based manner, here, we first defined recombination hot/cold spots based on public high-resolution Spo11-oligo-seq data, then characterized them in terms of DNA sequence and epigenetic marks, and finally presented classifiers to identify hotspots. We found that, in addition to some previously discovered DNA-based features like GC-skew, recombination hotspots in yeast can also be characterized by some remarkable features associated with DNA physical properties and shape. More importantly, by using DNA-based features and several epigenetic marks, we built several classifiers to discriminate hotspots from coldspots, and found that SVM classifier performs the best with an accuracy of ∼92%, which is also the highest among the models in comparison. Feature importance analysis combined with prediction results show that epigenetic marks and variation of sequence-based features along the hotspots contribute dominantly to hotspot identification. By using incremental feature selection method, an optimal feature subset that consists of much less features was obtained without sacrificing prediction accuracy.


Sign in / Sign up

Export Citation Format

Share Document