scholarly journals Preserving Ordinal Consensus: Towards Feature Selection for Unlabeled Data

2020 ◽  
Vol 34 (01) ◽  
pp. 75-82
Author(s):  
Jun Guo ◽  
Heng Chang ◽  
Wenwu Zhu

To better pre-process unlabeled data, most existing feature selection methods remove redundant and noisy information by exploring some intrinsic structures embedded in samples. However, these unsupervised studies focus too much on the relations among samples, totally neglecting the feature-level geometric information. This paper proposes an unsupervised triplet-induced graph to explore a new type of potential structure at feature level, and incorporates it into simultaneous feature selection and clustering. In the feature selection part, we design an ordinal consensus preserving term based on a triplet-induced graph. This term enforces the projection vectors to preserve the relative proximity of original features, which contributes to selecting more relevant features. In the clustering part, Self-Paced Learning (SPL) is introduced to gradually learn from ‘easy’ to ‘complex’ samples. SPL alleviates the dilemma of falling into the bad local minima incurred by noise and outliers. Specifically, we propose a compelling regularizer for SPL to obtain a robust loss. Finally, an alternating minimization algorithm is developed to efficiently optimize the proposed model. Extensive experiments on different benchmark datasets consistently demonstrate the superiority of our proposed method.

2015 ◽  
Vol 1 (311) ◽  
Author(s):  
Katarzyna Stąpor

Discriminant Analysis can best be defined as a technique which allows the classification of an individual into several dictinctive populations on the basis of a set of measurements. Stepwise discriminant analysis (SDA) is concerned with selecting the most important variables whilst retaining the highest discrimination power possible. The process of selecting a smaller number of variables is often necessary for a variety number of reasons. In the existing statistical software packages SDA is based on the classic feature selection methods. Many problems with such stepwise procedures have been identified. In this work the new method based on the metaheuristic strategy tabu search will be presented together with the experimental results conducted on the selected benchmark datasets. The results are promising.


Entropy ◽  
2019 ◽  
Vol 21 (10) ◽  
pp. 988 ◽  
Author(s):  
Fazakis ◽  
Kanas ◽  
Aridas ◽  
Karlos ◽  
Kotsiantis

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Zhixun Zhao ◽  
Xiaocai Zhang ◽  
Fang Chen ◽  
Liang Fang ◽  
Jinyan Li

Abstract Background DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. Results The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. Conclusions The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.


2018 ◽  
Vol 6 (1) ◽  
pp. 58-72
Author(s):  
Omar A. M. Salem ◽  
Liwei Wang

Building classification models from real-world datasets became a difficult task, especially in datasets with high dimensional features. Unfortunately, these datasets may include irrelevant or redundant features which have a negative effect on the classification performance. Selecting the significant features and eliminating undesirable features can improve the classification models. Fuzzy mutual information is widely used feature selection to find the best feature subset before classification process. However, it requires more computation and storage space. To overcome these limitations, this paper proposes an improved fuzzy mutual information feature selection based on representative samples. Based on benchmark datasets, the experiments show that the proposed method achieved better results in the terms of classification accuracy, selected feature subset size, storage, and stability.


Author(s):  
J Qu ◽  
Z Liu ◽  
M J Zuo ◽  
H-Z Huang

Feature selection is an effective way of improving classification, reducing feature dimension, and speeding up computation. This work studies a reported support vector machine (SVM) based method of feature selection. Our results reveal discrepancies in both its feature ranking and feature selection schemes. Modifications are thus made on which our SVM-based method of feature selection is proposed. Using the weighting fusion technique and the one-against-all approach, our binary model has been extensively updated for multi-class classification problems. Three benchmark datasets are employed to demonstrate the performance of the proposed method. The multi-class model of the proposed method is also used for feature selection in planetary gear damage degree classification. The results of all datasets exhibit the consistently effective classification made possible by the proposed method.


2019 ◽  
Vol 16 (8) ◽  
pp. 3603-3607 ◽  
Author(s):  
Shraddha Khonde ◽  
V. Ulagamuthalvi

Considering current network scenario hackers and intruders has become a big threat today. As new technologies are emerging fast, extensive use of these technologies and computers, what plays an important role is security. Most of the computers in network can be easily compromised with attacks. Big issue of concern is increase in new type of attack these days. Security to the sensitive data is very big threat to deal with, it need to consider as high priority issue which should be addressed immediately. Highly efficient Intrusion Detection Systems (IDS) are available now a days which detects various types of attacks on network. But we require the IDS which is intelligent enough to detect and analyze all type of new threats on the network. Maximum accuracy is expected by any of this intelligent intrusion detection system. An Intrusion Detection System can be hardware or software that analyze and monitors all activities of network to detect malicious activities happened inside the network. It also informs and helps administrator to deal with malicious packets, which if enters in network can harm more number of computers connected together. In our work we have implemented an intellectual IDS which helps administrator to analyze real time network traffic. IDS does it by classifying packets entering into the system as normal or malicious. This paper mainly focus on techniques used for feature selection to reduce number of features from KDD-99 dataset. This paper also explains algorithm used for classification i.e., Random Forest which works with forest of trees to classify real time packet as normal or malicious. Random forest makes use of ensembling techniques to give final output which is derived by combining output from number of trees used to create forest. Dataset which is used while performing experiments is KDD-99. This dataset is used to train all trees to get more accuracy with help of random forest. From results achieved we can observe that random forest algorithm gives more accuracy in distributed network with reduced false alarm rate.


2019 ◽  
Vol 28 (05) ◽  
pp. 1950016
Author(s):  
H. Benjamin Fredrick David ◽  
A. Suruliandi ◽  
S. P. Raja

Data mining integrates statistical analysis, machine learning and database technology to extract hidden patterns and relationships from data. The presence of irrelevant, redundant and inconsistent attributes in the data ushers poor classification accuracy. In this paper, a novel bio-inspired heuristic swarm optimization algorithm for feature selection, namely Constructive Lazy Wolf Search Algorithm is proposed based on the backbone of the Wolf Search Algorithm. It is based on the behavior of the real wolves, which search for their food and consequently survive the attacks of the threats by avoiding them. Based on the study conducted on the behavior of wolves two natural factors, namely laziness and health are introduced for attaining highest efficiency. Restricting and controlling the wolves’ behavior by allowing only healthy and constructive lazy wolves to take part in the search reduces the search time and complexity required to search for the best fitness. The proposed algorithm is then applied on a prisoner dataset for crime propensity prediction along with a few benchmark datasets to prove the stability in the improved performance compared with other bio-inspired optimization algorithms. The accuracy achieved by fine-tuning the proposed algorithm was 98.19% providing accurate crime prevention.


2020 ◽  
Vol 9 (2) ◽  
pp. 59-79
Author(s):  
Heisnam Rohen Singh ◽  
Saroj Kr Biswas

Recent trends in data mining and machine learning focus on knowledge extraction and explanation, to make crucial decisions from data, but data is virtually enormous in size and mostly associated with noise. Neuro-fuzzy systems are most suitable for representing knowledge in a data-driven environment. Many neuro-fuzzy systems were proposed for feature selection and classification; however, they focus on quantitative (accuracy) than qualitative (transparency). Such neuro-fuzzy systems for feature selection and classification include Enhance Neuro-Fuzzy (ENF) and Adaptive Dynamic Clustering Neuro-Fuzzy (ADCNF). Here a neuro-fuzzy system is proposed for feature selection and classification with improved accuracy and transparency. The novelty of the proposed system lies in determining a significant number of linguistic features for each input and in suggesting a compelling order of classification rules using the importance of input feature and the certainty of the rules. The performance of the proposed system is tested with 8 benchmark datasets. 10-fold cross-validation is used to compare the accuracy of the systems. Other performance measures such as false positive rate, precision, recall, f-measure, Matthews correlation coefficient and Nauck's index are also used for comparing the systems. It is observed from the experimental results that the proposed system is superior to the existing neuro-fuzzy systems.


2014 ◽  
Vol 536-537 ◽  
pp. 450-453 ◽  
Author(s):  
Jiang Jiang ◽  
Xi Chen ◽  
Hai Tao Gan

In this paper, a sparsity based model is proposed for feature selection in kernel minimum squared error (KMSE). By imposing a sparsity shrinkage term, we formulate the procedure of subset selection as an optimization problem. With the chosen small portion of training examples, the computational burden of feature extraction is largely alleviated. Experimental results conducted on several benchmark datasets indicate the effectivity and efficiency of our method.


Sign in / Sign up

Export Citation Format

Share Document