Enhanced Filter Feature Selection Methods for Arabic Text Categorization

Abdullah Saeed Ghareb; Azuraliza Abu Bakara; Qasem A. Al-Radaideh; Abdul Razak Hamdan

doi:10.4018/ijirr.2018040101

Enhanced Filter Feature Selection Methods for Arabic Text Categorization

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018040101 ◽

2018 ◽

Vol 8 (2) ◽

pp. 1-24 ◽

Cited By ~ 1

Author(s):

Abdullah Saeed Ghareb ◽

Azuraliza Abu Bakara ◽

Qasem A. Al-Radaideh ◽

Abdul Razak Hamdan

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Process ◽

High Dimensional Data ◽

Relevant Information ◽

High Dimensional ◽

Arabic Text ◽

Relevant Feature ◽

Associative Classification ◽

Selection Methods

The filtering of a large amount of data is an important process in data mining tasks, particularly for the categorization of unstructured high dimensional data. Therefore, a feature selection process is desired to reduce the space of high dimensional data into small relevant subset dimensions that represent the best features for text categorization. In this article, three enhanced filter feature selection methods, Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2, are proposed. These methods combine the relevant information about features in both the inter- and intra-category. The effectiveness of the proposed methods with Naïve Bayes and associative classification is evaluated by traditional measures of text categorization, namely, macro-averaging of precision, recall, and F-measure. Experiments are conducted on three Arabic text datasets used for text categorization. The experimental results showed that the proposed methods are able to achieve better and comparable results when compared to 12 well known traditional methods.

Download Full-text

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Ingénierie des systèmes d information ◽

10.18280/isi.260107 ◽

2021 ◽

Vol 26 (1) ◽

pp. 67-77

Author(s):

Siva Sankari Subbiah ◽

Jayakumar Chinnappan

Keyword(s):

Feature Selection ◽

Big Data ◽

Large Scale ◽

High Dimensional Data ◽

Research Work ◽

Basic Feature ◽

High Dimensional ◽

Selection Methods ◽

Fast Development ◽

Improved Accuracy

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text

On the scalability of feature selection methods on high-dimensional data

Knowledge and Information Systems ◽

10.1007/s10115-017-1140-3 ◽

2017 ◽

Vol 56 (2) ◽

pp. 395-442 ◽

Cited By ~ 11

Author(s):

V. Bolón-Canedo ◽

D. Rego-Fernández ◽

D. Peteiro-Barral ◽

A. Alonso-Betanzos ◽

B. Guijarro-Berdiñas ◽

...

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Methods

Download Full-text

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

The Scientific World JOURNAL ◽

10.1155/2015/471371 ◽

2015 ◽

Vol 2015 ◽

pp. 1-18 ◽

Cited By ~ 16

Author(s):

Thanh-Tung Nguyen ◽

Joshua Zhexue Huang ◽

Thuy Thi Nguyen

Keyword(s):

Feature Selection ◽

Random Forests ◽

Selection Process ◽

High Dimensional Data ◽

Feature Weighting ◽

High Dimensional ◽

Feature Subset ◽

Value Assessment ◽

Statistical Measures ◽

Real World Datasets

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features usingp-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

Download Full-text

Feature Selection

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch135 ◽

2011 ◽

pp. 878-882

Author(s):

Damien François

Keyword(s):

Feature Selection ◽

Time Series Prediction ◽

High Dimensional Data ◽

Principal Component ◽

Point Of View ◽

High Dimensional ◽

Feature Subset ◽

Selection Methods ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

In many applications, like function approximation, pattern recognition, time series prediction, and data mining, one has to build a model relating some features describing the data to some response value. Often, the features that are relevant for building the model are not known in advance. Feature selection methods allow removing irrelevant and/or redundant features to only keep the feature subset that are most useful to build a prediction model. The model is simpler and easier to interpret, reducing the risks of overfitting, non-convergence, etc. By contrast with other dimensionality reduction techniques such as principal component analysis or more recent nonlinear projection techniques (Lee & Verleysen 2007), which build a new, smaller set of features, the features that are selected by feature selection methods preserve their initial meaning, potentially bringing extra information about the process being modeled (Guyon 2006). Recently, the advent of high-dimensional data has raised new challenges for feature selection methods, both from the algorithmic point of view and the conceptual point of view (Liu & Motoda 2007). The problem of feature selection is exponential in nature, and many approximate algorithms are cubic with respect to the initial number of features, which may be intractable when the dimensionality of the data is large. Furthermore, high-dimensional data are often highly redundant, and two distinct subsets of features may have very similar predictive power, which can make it difficult to identify the best subset.

Download Full-text

Empirical Study of Feature Selection Methods for High Dimensional Data

Indian Journal of Science and Technology ◽

10.17485/ijst/2016/v9i39/90599 ◽

2016 ◽

Vol 9 (39) ◽

Author(s):

S. DeepaLakshmi ◽

T. Velmurugan

Keyword(s):

Feature Selection ◽

Empirical Study ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Methods

Download Full-text

Comparative Study of Microarray Based Disease Prediction - A Survey

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195435 ◽

2019 ◽

pp. 189-197

Author(s):

T. Sneka ◽

K. Palanivel

Keyword(s):

Feature Selection ◽

Selection Process ◽

Genetic Diseases ◽

Noisy Data ◽

High Dimensional ◽

Disease Prediction ◽

Selection Methods ◽

Classification Techniques ◽

Clustering And Classification ◽

Supervised Methods

Recognition of genetic expression becomes an important issue for research while diagnosing genetic diseases. Microarrays are considered as the representation for identifying gene behaviors that may help in detection process. Hence, it is used in analyzing samples that may be normal or affected, also in diagnosing various gene-based diseases. Various clustering and classification techniques were used to face the challenges in handling microarray. High dimensional data is one of the major issues caused while handling microarray. Also because of this issue, possibilities of redundant, irrelevant and noisy data may occur. To solve this problem feature selection process which optimally extracts the features is introduced in clustering in classification techniques. This survey observes some various techniques of classification, clustering of genes and feature selection methods such as supervised, unsupervised and semi-supervised methods. To determine the suitable semi-supervised algorithm that combines and analyze for detecting new or difficult mutated disease. This survey shows that how semi-supervised approach evolves and outperforms the existing algorithms.

Download Full-text

Literature Review on Feature Selection Methods for High-Dimensional Data

International Journal of Computer Applications ◽

10.5120/ijca2016908317 ◽

2016 ◽

Vol 136 (1) ◽

pp. 9-17 ◽

Cited By ~ 13

Author(s):

D. Asir ◽

S. Appavu ◽

E. Jebamalar

Keyword(s):

Feature Selection ◽

Literature Review ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Methods

Download Full-text

A Cooperative Coevolutionary Approach to Discretization-Based Feature Selection for High-Dimensional Data

Entropy ◽

10.3390/e22060613 ◽

2020 ◽

Vol 22 (6) ◽

pp. 613

Author(s):

Yu Zhou ◽

Junhao Kang ◽

Xiao Zhang

Keyword(s):

Feature Selection ◽

State Of The Art ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Methods ◽

Swarm Optimization ◽

Cut Points ◽

Selection For ◽

The Individual ◽

Ranking Mechanism

Recent discretization-based feature selection methods show great advantages by introducing the entropy-based cut-points for features to integrate discretization and feature selection into one stage for high-dimensional data. However, current methods usually consider the individual features independently, ignoring the interaction between features with cut-points and those without cut-points, which results in information loss. In this paper, we propose a cooperative coevolutionary algorithm based on the genetic algorithm (GA) and particle swarm optimization (PSO), which searches for the feature subsets with and without entropy-based cut-points simultaneously. For the features with cut-points, a ranking mechanism is used to control the probability of mutation and crossover in GA. In addition, a binary-coded PSO is applied to update the indices of the selected features without cut-points. Experimental results on 10 real datasets verify the effectiveness of our algorithm in classification accuracy compared with several state-of-the-art competitors.

Download Full-text

BagMeLiF: stable boosting-based hybrid-ensemble feature selection algorithm for high-dimensional data

2020 International Conference on Control, Robotics and Intelligent System ◽

10.1145/3437802.3437835 ◽

2020 ◽

Author(s):

Nikita Pilnenskiy ◽

Ivan Smetannikov

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm

Download Full-text

Integrating Noun-Based Feature Ranking and Selection Methods with Arabic Text Associative Classification Approach

Arabian Journal for Science and Engineering ◽

10.1007/s13369-014-1304-3 ◽

2014 ◽

Vol 39 (11) ◽

pp. 7807-7822 ◽

Cited By ~ 2

Author(s):

Abdullah S. Ghareb ◽

Abdul Razak Hamdan ◽

Azuraliza Abu Bakar

Keyword(s):

Ranking And Selection ◽

Feature Ranking ◽

Arabic Text ◽

Associative Classification ◽

Selection Methods ◽

Classification Approach

Download Full-text