BDselect: a package for k-mer selection based on the binomial distribution

2021 ◽  
Vol 16 ◽  
Author(s):  
Fu-Ying Dao ◽  
Hao Lv ◽  
Zhao-Yue Zhang ◽  
Hao Lin

Background: Dimension disaster is often associated with feature extraction. The extracted features may contain more redundant feature information, which leads to the limitation of computing ability and overfitting problems. Objective: Feature selection is an important strategy to overcome the problems from dimension disaster. In most machine learning tasks, features determine the upper limit of the model performance. Therefore, more and more feature selection methods should be developed to optimize features. Methods: In this paper, we introduce a new technique to optimize sequence features based on the binomial distribution (BD). Firstly, the principle of the binomial distribution algorithm is introduced in detail. Then, the proposed algorithm is compared with other commonly used feature selection methods on three different types of datasets by using a Random Forest classifier with the same parameters. Results and Conclusion: The results confirm that BD has a promising improvement in feature selection and classification accuracy. Finally, we provide the source code and executable program package (http://lin-group.cn/server/BDselect/), by which users can easily perform our algorithm in their research.

Author(s):  
Kwan Yi ◽  
Jamshid Beheshti

In document representation for digitalized text, feature selection refers to the selection of the terms of representing a document and of distinguishing it from other documents. This study probes different feature selection methods for HMM learning models to explore how they affect the model performance, which is experimented in the context of text categorization task.Dans la représentation documentaire des textes numérisés, la sélection des caractéristiques se fonde sur la sélection des termes représentant et distinguant un document des autres documents. Cette étude examine différents modèles de sélection de caractéristiques pour les modèles d’apprentissage MMC, afin d’explorer comment ils affectent la performance du modèle, qui est observé dans le contexte de la tâche de catégorisation textuelle. 


2021 ◽  
Vol 5 (EICS) ◽  
pp. 1-25
Author(s):  
Ighoyota Ben Ajenaghughrure ◽  
Sonia Cláudia Da Costa Sousa ◽  
David Lamas

Trust as a precursor for users' acceptance of artificial intelligence (AI) technologies that operate as a conceptual extension of humans (e.g., autonomous vehicles (AVs)) is highly influenced by users' risk perception amongst other factors. Prior studies that investigated the interplay between risk and trust perception recommended the development of real-time tools for monitoring cognitive states (e.g., trust). The primary objective of this study was to investigate a feature selection method that yields feature sets that can help develop a highly optimized and stable ensemble trust classifier model. The secondary objective of this study was to investigate how varying levels of risk perception influence users' trust and overall reliance on technology. A within-subject four-condition experiment was implemented with an AV driving game. This experiment involved 25 participants, and their electroencephalogram, electrodermal activity, and facial electromyogram psychophysiological signals were acquired. We applied wrapper, filter, and hybrid feature selection methods on the 82 features extracted from the psychophysiological signals. We trained and tested five voting-based ensemble trust classifier models using training and testing datasets containing only the features identified by the feature selection methods. The results indicate the superiority of the hybrid feature selection method over other methods in terms of model performance. In addition, the self-reported trust measurement and overall reliance of participants on the technology (AV) measured with joystick movements throughout the game reveals that a reduction in risk results in an increase in trust and overall reliance on technology.


2015 ◽  
Vol 43 (2) ◽  
pp. 174-185 ◽  
Author(s):  
Deniz Kılınç ◽  
Akın Özçift ◽  
Fatma Bozyigit ◽  
Pelin Yıldırım ◽  
Fatih Yücalar ◽  
...  

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Shunhao Jin ◽  
Fenlin Liu ◽  
Chunfang Yang ◽  
Yuanyuan Ma ◽  
Yuan Liu

Currently, the popular Rich Model steganalysis features usually contain a large number of redundant feature components which may bring “curse of dimensionality” and large computation cost, but the existing feature selection methods are difficult to effectively reduce the dimensionality when there are many strongly correlated effective feature components. This paper proposes a novel selection method for Rich Model steganalysis features. First, the separability of each feature component in the submodels of Rich Model is measured based on the Fisher criterion, and the feature components are sorted in the descending order based on the separability. Second, the correlation coefficient between any two feature components in each submodel is calculated, and feature selection is performed according to the Fisher value of each component and the correlation coefficients. Finally, the selected submodels are combined as the final steganalysis feature. The results show that the proposed feature selection method can effectively reduce the dimensionalities of JPEG domain and spatial domain Rich Model steganalysis features without affecting the detection accuracies.


2014 ◽  
Vol 1044-1045 ◽  
pp. 1258-1261
Author(s):  
Su Fen Chen

Feature selection is an effective pre-processing technology to facilitate text mining on high dimensional feature space. In recent years, many effective redundant feature selection methods have been proposed from different motivations. However, a comparative experimental study on redundant feature selection methods in the field of text mining has not been reported yet. In order to solve this problem, an extensive empirical comparative study with the task of text classification is given in the paper. The experimental results indicate that the 3-way Mutual Information represents the redundancy much better than traditional 2-way Mutual Information, since the label information are considered by 3-way Mutual Information. As a result, the performances of redundant feature selection methods based on 3-way Mutual Information overwhelm other methods.


Author(s):  
Fatemeh Alighardashi ◽  
Mohammad Ali Zare Chahooki

Improving the software product quality before releasing by periodic tests is one of the most expensive activities in software projects. Due to limited resources to modules test in software projects, it is important to identify fault-prone modules and use the test sources for fault prediction in these modules. Software fault predictors based on machine learning algorithms, are effective tools for identifying fault-prone modules. Extensive studies are being done in this field to find the connection between features of software modules, and their fault-prone. Some of features in predictive algorithms are ineffective and reduce the accuracy of prediction process. So, feature selection methods to increase performance of prediction models in fault-prone modules are widely used. In this study, we proposed a feature selection method for effective selection of features, by using combination of filter feature selection methods. In the proposed filter method, the combination of several filter feature selection methods presented as fused weighed filter method. Then, the proposed method caused convergence rate of feature selection as well as the accuracy improvement. The obtained results on NASA and PROMISE with ten datasets, indicates the effectiveness of proposed method in improvement of accuracy and convergence of software fault prediction.


Sign in / Sign up

Export Citation Format

Share Document