BDselect: a package for k-mer selection based on the binomial distribution

Background: Dimension disaster is often associated with feature extraction. The extracted features may contain more redundant feature information, which leads to the limitation of computing ability and overfitting problems. Objective: Feature selection is an important strategy to overcome the problems from dimension disaster. In most machine learning tasks, features determine the upper limit of the model performance. Therefore, more and more feature selection methods should be developed to optimize features. Methods: In this paper, we introduce a new technique to optimize sequence features based on the binomial distribution (BD). Firstly, the principle of the binomial distribution algorithm is introduced in detail. Then, the proposed algorithm is compared with other commonly used feature selection methods on three different types of datasets by using a Random Forest classifier with the same parameters. Results and Conclusion: The results confirm that BD has a promising improvement in feature selection and classification accuracy. Finally, we provide the source code and executable program package (http://lin-group.cn/server/BDselect/), by which users can easily perform our algorithm in their research.

Download Full-text

A comparative study of redundant feature detection based feature selection methods

2014 International Conference on Computer, Information and Telecommunication Systems (CITS) ◽

10.1109/cits.2014.6878974 ◽

2014 ◽

Author(s):

Xue-Qiang Zeng ◽

Qian-Sheng Chen

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Feature Detection ◽

Selection Methods ◽

Redundant Feature

Download Full-text

A Comparative Study on Feature Selection of Text Categorization for Hidden Markov Models

Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI ◽

10.29173/cais341 ◽

2013 ◽

Author(s):

Kwan Yi ◽

Jamshid Beheshti

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Markov Models ◽

Hidden Markov ◽

Model Performance ◽

Document Representation ◽

Selection Methods ◽

Learning Models ◽

Text Feature ◽

Selection Of

In document representation for digitalized text, feature selection refers to the selection of the terms of representing a document and of distinguishing it from other documents. This study probes different feature selection methods for HMM learning models to explore how they affect the model performance, which is experimented in the context of text categorization task.Dans la représentation documentaire des textes numérisés, la sélection des caractéristiques se fonde sur la sélection des termes représentant et distinguant un document des autres documents. Cette étude examine différents modèles de sélection de caractéristiques pour les modèles d’apprentissage MMC, afin d’explorer comment ils affectent la performance du modèle, qui est observé dans le contexte de la tâche de catégorisation textuelle.

Download Full-text

Psychophysiological Modeling of Trust In Technology

Proceedings of the ACM on Human-Computer Interaction ◽

10.1145/3459745 ◽

2021 ◽

Vol 5 (EICS) ◽

pp. 1-25

Author(s):

Ighoyota Ben Ajenaghughrure ◽

Sonia Cláudia Da Costa Sousa ◽

David Lamas

Keyword(s):

Feature Selection ◽

Risk Perception ◽

Autonomous Vehicles ◽

Electrodermal Activity ◽

Feature Selection Method ◽

Model Performance ◽

Selection Method ◽

Primary Objective ◽

Selection Methods ◽

Trust In Technology

Trust as a precursor for users' acceptance of artificial intelligence (AI) technologies that operate as a conceptual extension of humans (e.g., autonomous vehicles (AVs)) is highly influenced by users' risk perception amongst other factors. Prior studies that investigated the interplay between risk and trust perception recommended the development of real-time tools for monitoring cognitive states (e.g., trust). The primary objective of this study was to investigate a feature selection method that yields feature sets that can help develop a highly optimized and stable ensemble trust classifier model. The secondary objective of this study was to investigate how varying levels of risk perception influence users' trust and overall reliance on technology. A within-subject four-condition experiment was implemented with an AV driving game. This experiment involved 25 participants, and their electroencephalogram, electrodermal activity, and facial electromyogram psychophysiological signals were acquired. We applied wrapper, filter, and hybrid feature selection methods on the 82 features extracted from the psychophysiological signals. We trained and tested five voting-based ensemble trust classifier models using training and testing datasets containing only the features identified by the feature selection methods. The results indicate the superiority of the hybrid feature selection method over other methods in terms of model performance. In addition, the self-reported trust measurement and overall reliance of participants on the technology (AV) measured with joystick movements throughout the game reveals that a reduction in risk results in an increase in trust and overall reliance on technology.

Download Full-text

TTC-3600: A new benchmark dataset for Turkish text categorization

Journal of Information Science ◽

10.1177/0165551515620551 ◽

2015 ◽

Vol 43 (2) ◽

pp. 174-185 ◽

Cited By ~ 23

Author(s):

Deniz Kılınç ◽

Akın Özçift ◽

Fatma Bozyigit ◽

Pelin Yıldırım ◽

Fatih Yücalar ◽

...

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Learning Task ◽

Selection Method ◽

Random Forest Classifier ◽

Experimental Results ◽

Selection Methods ◽

File Formats ◽

Accuracy Criterion

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.

Download Full-text

A New Technique for Sentiment Analysis System Based on Deep Learning Using Chi-Square Feature Selection Methods

Balkan Journal of Electrical and Computer Engineering ◽

10.17694/bajece.887339 ◽

2021 ◽

Author(s):

Mohammed HUSSEİN ◽

Fatih ÖZYURT

Keyword(s):

Feature Selection ◽

Deep Learning ◽

Sentiment Analysis ◽

New Technique ◽

Selection Methods ◽

Chi Square ◽

A New Technique ◽

Analysis System

Download Full-text

Feature Selection of the Rich Model Based on the Correlation of Feature Components

Security and Communication Networks ◽

10.1155/2021/6680528 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Shunhao Jin ◽

Fenlin Liu ◽

Chunfang Yang ◽

Yuanyuan Ma ◽

Yuan Liu

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Correlation Coefficients ◽

Selection Method ◽

Strongly Correlated ◽

Selection Methods ◽

Fisher Criterion ◽

Redundant Feature ◽

The Rich ◽

Selection Of

Currently, the popular Rich Model steganalysis features usually contain a large number of redundant feature components which may bring “curse of dimensionality” and large computation cost, but the existing feature selection methods are difficult to effectively reduce the dimensionality when there are many strongly correlated effective feature components. This paper proposes a novel selection method for Rich Model steganalysis features. First, the separability of each feature component in the submodels of Rich Model is measured based on the Fisher criterion, and the feature components are sorted in the descending order based on the separability. Second, the correlation coefficient between any two feature components in each submodel is calculated, and feature selection is performed according to the Fisher value of each component and the correlation coefficients. Finally, the selected submodels are combined as the final steganalysis feature. The results show that the proposed feature selection method can effectively reduce the dimensionalities of JPEG domain and spatial domain Rich Model steganalysis features without affecting the detection accuracies.

Download Full-text

Redundant Feature Selection Methods in Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1044-1045.1258 ◽

2014 ◽

Vol 1044-1045 ◽

pp. 1258-1261

Author(s):

Su Fen Chen

Keyword(s):

Feature Selection ◽

Text Mining ◽

Mutual Information ◽

Text Classification ◽

Feature Space ◽

High Dimensional ◽

Selection Methods ◽

Redundant Feature ◽

Label Information ◽

Better Than

Feature selection is an effective pre-processing technology to facilitate text mining on high dimensional feature space. In recent years, many effective redundant feature selection methods have been proposed from different motivations. However, a comparative experimental study on redundant feature selection methods in the field of text mining has not been reported yet. In order to solve this problem, an extensive empirical comparative study with the task of text classification is given in the paper. The experimental results indicate that the 3-way Mutual Information represents the redundancy much better than traditional 2-way Mutual Information, since the label information are considered by 3-way Mutual Information. As a result, the performances of redundant feature selection methods based on 3-way Mutual Information overwhelm other methods.

Download Full-text

Sentiment Analysis of Movie Reviews: A Study of Machine Learning Algorithms with Various Feature Selection Methods

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i9.113121 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 1

Author(s):

Rajwinder Kaur

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

The Effectiveness of the Fused Weighted Filter Feature Selection Method to Improve Software Fault Prediction

Journal of Communications Technology Electronics and Computer Science ◽

10.22385/jctecs.v8i0.96 ◽

2016 ◽

Vol 8 ◽

pp. 5 ◽

Cited By ~ 1

Author(s):

Fatemeh Alighardashi ◽

Mohammad Ali Zare Chahooki

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Machine Learning Algorithms ◽

Fault Prediction ◽

Filter Method ◽

Selection Methods ◽

Software Projects ◽

Software Fault Prediction ◽

Software Fault

Improving the software product quality before releasing by periodic tests is one of the most expensive activities in software projects. Due to limited resources to modules test in software projects, it is important to identify fault-prone modules and use the test sources for fault prediction in these modules. Software fault predictors based on machine learning algorithms, are effective tools for identifying fault-prone modules. Extensive studies are being done in this field to find the connection between features of software modules, and their fault-prone. Some of features in predictive algorithms are ineffective and reduce the accuracy of prediction process. So, feature selection methods to increase performance of prediction models in fault-prone modules are widely used. In this study, we proposed a feature selection method for effective selection of features, by using combination of filter feature selection methods. In the proposed filter method, the combination of several filter feature selection methods presented as fused weighed filter method. Then, the proposed method caused convergence rate of feature selection as well as the accuracy improvement. The obtained results on NASA and PROMISE with ten datasets, indicates the effectiveness of proposed method in improvement of accuracy and convergence of software fault prediction.

Download Full-text

Feature Selection Methods in Sentiment Analysis

Proceedings of the 3rd International Conference on Networking, Information Systems & Security ◽

10.1145/3386723.3387840 ◽

2020 ◽

Author(s):

Nurilhami Izzatie Khairi ◽

Azlinah Mohamed ◽

Nor Nadiah Yusof

Keyword(s):

Feature Selection ◽

Sentiment Analysis ◽

Selection Methods

Download Full-text