Efficient multi-cluster feature selection on text data

This paper concerns with variable screening when highly correlated variables exist in high-dimensional linear models. We propose a novel cluster feature selection (CFS) procedure based on the elastic net and linear correlation variable screening to enjoy the benefits of the two methods. When calculating the correlation between the predictor and the response, we consider highly correlated groups of predictors instead of the individual ones. This is in contrast to the usual linear correlation variable screening. Within each correlated group, we apply the elastic net to select variables and estimate their parameters. This avoids the drawback of mistakenly eliminating true relevant variables when they are highly correlated like LASSO [R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B 58 (1996) 268–288] does. After applying the CFS procedure, the maximum absolute correlation coefficient between clusters becomes smaller and any common model selection methods like sure independence screening (SIS) [J. Fan and J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B 70 (2008) 849–911] or LASSO can be applied to improve the results. Extensive numerical examples including pure simulation examples and semi-real examples are conducted to show the good performances of our procedure.

Download Full-text

A Clustering Based Feature Selection Method Using Feature Information Distance for Text Data

Intelligent Computing Theories and Application - Lecture Notes in Computer Science ◽

10.1007/978-3-319-42291-6_12 ◽

2016 ◽

pp. 122-132 ◽

Cited By ~ 2

Author(s):

Shilong Chao ◽

Jie Cai ◽

Sheng Yang ◽

Shulin Wang

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Text Data ◽

Information Distance ◽

Feature Information

Download Full-text

New approach for Arabic named entity recognition on social media based on feature selection using genetic algorithm

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i2.pp1485-1497 ◽

2021 ◽

Vol 11 (2) ◽

pp. 1485

Author(s):

Brahim Ait Benali ◽

Soukaina Mihi ◽

Ismail El Bazi ◽

Nabil Laachfoubi

Keyword(s):

Genetic Algorithm ◽

Social Media ◽

Feature Selection ◽

Named Entity Recognition ◽

Entity Recognition ◽

Support Vector ◽

Text Data ◽

Impact Performance ◽

Named Entity ◽

Feature Selection Approach

Many features can be extracted from the massive volume of data in different types that are available nowadays on social media. The growing demand for multimedia applications was an essential factor in this regard, particularly in the case of text data. Often, using the full feature set for each of these activities can be time-consuming and can also negatively impact performance. It is challenging to find a subset of features that are useful for a given task due to a large number of features. In this paper, we employed a feature selection approach using the genetic algorithm to identify the optimized feature set. Afterward, the best combination of the optimal feature set is used to identify and classify the Arabic named entities (NEs) based on support vector. Experimental results show that our system reaches a state-of-the-art performance of the Arab NER on social media and significantly outperforms the previous systems.

Download Full-text

Optimasi Seleksi Fitur dengan Teknik Reduksi Dimensi pada Klasifikasi Abstrak Jurnal

Jurnal Penelitian Enjiniring ◽

10.25042/jpe.052018.08 ◽

2019 ◽

Vol 22 (1) ◽

pp. 44-48

Author(s):

Syukriyanto Latif

Keyword(s):

Feature Selection ◽

Dimension Reduction ◽

Selection Process ◽

Computation Time ◽

Training Data ◽

Intelligent Transport System ◽

Text Data ◽

Intelligent Transport ◽

Average Accuracy ◽

Bayes Algorithm

The purpose of this research is to know dimension reduction parameter value at feature selection so as to improve accuracy and reduce computation time. This system uses text mining technology that extracts text data to find information from a set of documents. Word weighting and Term Reduction Technique The term Frequency Thresholding is used in the feature selection process, while in the classification process using the Naive Bayes algorithm. the abstract of the journal is categorized into 3 namely Data Mining (DM), Intelligent Transport System (ITS) and Multimedia (MM). The total number of test data and training data is 150 data. The best classification results are obtained when the dimension reduction parameter value is 30%. At that condition obtained an average accuracy of 87.33% with a computation time of 4 minutes 12 seconds.

Download Full-text

A Brief Study of Approaches to Text Feature Selection

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch009 ◽

2018 ◽

pp. 216-243

Author(s):

Ravindra Babu Tallamaraju ◽

Manas Kirti

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Social Networking ◽

Text Data ◽

Social Networking Services ◽

Storage Devices ◽

Text Feature ◽

Wide Range ◽

Efficiency And Effectiveness ◽

Language Text

With reducing cost of storage devices, increasing amounts of data is being stored and processed for extracting intelligence. Classification and clustering have been two major approaches in generating data abstraction. Over the last few years, text data is dominating the types of data shared and stored. Some of the sources of such datasets are mobile data, e-commerce, and wide-range of continuously expanding social-networking services. Within each of these sources, the nature of data differs drastically from formal language text to Twitter or SMS slangs thereby leading to the need for different ways of processing the data for making meaningful summarization. Such summaries could effectively be used for business advantage. Processing of such data requires identifying appropriate set of features both for efficiency and effectiveness. In the current Chapter, we propose to discuss approaches to text feature selection and make a comparative study.

Download Full-text

Generalized refined composite multiscale fuzzy entropy and multi-cluster feature selection based intelligent fault diagnosis of rolling bearing

ISA Transactions ◽

10.1016/j.isatra.2021.05.042 ◽

2021 ◽

Author(s):

Jinde Zheng ◽

Haiyang Pan ◽

Jinyu Tong ◽

Qingyun Liu

Keyword(s):

Feature Selection ◽

Fault Diagnosis ◽

Rolling Bearing ◽

Fuzzy Entropy ◽

Intelligent Fault Diagnosis ◽

Cluster Feature

Download Full-text