A Comparative Study on Feature Selection in Chinese Text Classification Problem

Information explosion brings lots of challenges to text classification. The dimension disaster led to a sharp increase of computational complexity and lower classification accuracy. Therefore, it is critical to use feature selection techniques before actual classification. Automatic classification of English text has been researched for many years, but little on Chinese text. In this paper, several classic feature selection methods, namely TF, IG and CHI, are compared on classifying Chinese text. Meanwhile, we take imbalanced data into consideration in the paper. Experimental results show that CHI performed better than IG and TF when the dataset is imbalanced, but no obvious difference on balanced data.

Download Full-text

A comparative study on feature selection in Chinese text classification problem

2012 First National Conference for Engineering Sciences (FNCES 2012) ◽

10.1109/nces.2012.6544065 ◽

2012 ◽

Author(s):

Hu Li ◽

Peng Zou ◽

Weihong Han

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Chinese Text ◽

Text Classification ◽

Classification Problem ◽

Chinese Text Classification

Download Full-text

CLASSIFICATION OF HIGH-DIMENSIONAL MICROARRAY DATA WITH A TWO-STEP PROCEDURE VIA A WILCOXON CRITERION AND MULTILAYER PERCEPTRON

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026811002969 ◽

2011 ◽

Vol 10 (01) ◽

pp. 1-14

Author(s):

VLADIMIR NIKULIN ◽

TIAN-HSIANG HUANG ◽

GEOFFREY J. MCLACHLAN

Keyword(s):

Data Mining ◽

Feature Selection ◽

High Dimensional ◽

Second Step ◽

Support Vector ◽

Step Procedure ◽

Leave One Out ◽

Natural Combination ◽

Feature Selection Techniques

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.

Download Full-text

Impact of feature selection techniques in Text Classification: An Experimental study

JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES ◽

10.26782/jmcms.spl.3/2019.09.00004 ◽

2019 ◽

Vol 1 (3) ◽

Author(s):

S Rahamat Basha

Keyword(s):

Experimental Study ◽

Feature Selection ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

Performance Analysis of Feature Selection Techniques for Text Classification

International Research Journal on Advanced Science Hub ◽

10.47392/irjash.2020.259 ◽

2020 ◽

Vol 2 (Special Issue ICSTM 12S) ◽

pp. 44-50

Author(s):

Hemlata Patel ◽

Dhanraj Verma

Keyword(s):

Feature Selection ◽

Performance Analysis ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

Hybrid Ensemble Learning Methods for Classification of Microarray Data

Data Analytics in Medicine ◽

10.4018/978-1-7998-1204-3.ch038 ◽

2020 ◽

pp. 707-725

Author(s):

Sujata Dash

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Microarray Data ◽

Classification Model ◽

Rotation Forest ◽

Ensemble Technique ◽

Basic Characteristics ◽

Microarray Datasets ◽

Feature Selection Techniques

Efficient classification and feature extraction techniques pave an effective way for diagnosing cancers from microarray datasets. It has been observed that the conventional classification techniques have major limitations in discriminating the genes accurately. However, such kind of problems can be addressed by an ensemble technique to a great extent. In this paper, a hybrid RotBagg ensemble framework has been proposed to address the problem specified above. This technique is an integration of Rotation Forest and Bagging ensemble which in turn preserves the basic characteristics of ensemble architecture i.e., diversity and accuracy. Three different feature selection techniques are employed to select subsets of genes to improve the effectiveness and generalization of the RotBagg ensemble. The efficiency is validated through five microarray datasets and also compared with the results of base learners. The experimental results show that the correlation based FRFR with PCA-based RotBagg ensemble form a highly efficient classification model.

Download Full-text

A Comparative Study of Recent Feature Selection Techniques Used in Text Classification

IOT with Smart Systems - Smart Innovation, Systems and Technologies ◽

10.1007/978-981-16-3945-6_41 ◽

2022 ◽

pp. 423-436

Author(s):

Gunjan Singh ◽

Rashmi Priya

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

Dynamic feature selection strategy in incremental Chinese text classification

2012 2nd International Conference on Applied Robotics for the Power Industry (CARPI) ◽

10.1109/carpi.2012.6356526 ◽

2012 ◽

Author(s):

Dan Yang ◽

Xinghua Fan

Keyword(s):

Feature Selection ◽

Chinese Text ◽

Text Classification ◽

Selection Strategy ◽

Dynamic Feature ◽

Chinese Text Classification

Download Full-text

TEXT CLASSIFICATION BASED ON FUZZY RADIAL BASIS FUNCTION

Iraqi Journal for Computers and Informatics ◽

10.25195/ijci.v45i1.40 ◽

2019 ◽

Vol 45 (1) ◽

pp. 11-14

Author(s):

Zuhair Ali

Keyword(s):

Radial Basis Function ◽

Language Processing ◽

Text Classification ◽

Basis Function ◽

Automated Classification ◽

New Methods ◽

Radial Basis ◽

Document Collection ◽

Better Than

Automated classification of text into predefined categories has always been considered as a vital method in thenatural language processing field. In this paper new methods based on Radial Basis Function (RBF) and Fuzzy Radial BasisFunction (FRBF) are used to solve the problem of text classification, where a set of features extracted for each sentencein the document collection these set of features introduced to FRBF and RBF to classify documents. Reuters 21578 datasetutilized for the purpose of text classification. The results showed the effectiveness of FRBF is better than RBF.

Download Full-text

IoT information theft prediction using ensemble feature selection

Journal Of Big Data ◽

10.1186/s40537-021-00558-z ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Joffrey L. Leevy ◽

John Hancock ◽

Taghi M. Khoshgoftaar ◽

Jared M. Peterson

Keyword(s):

Feature Selection ◽

Operating Characteristic ◽

Characteristic Curve ◽

Classification Performance ◽

Feature Reduction ◽

Security Risk ◽

Precision Recall Curve ◽

Iot Devices ◽

Feature Selection Techniques ◽

Better Than

AbstractThe recent years have seen a proliferation of Internet of Things (IoT) devices and an associated security risk from an increasing volume of malicious traffic worldwide. For this reason, datasets such as Bot-IoT were created to train machine learning classifiers to identify attack traffic in IoT networks. In this study, we build predictive models with Bot-IoT to detect attacks represented by dataset instances from the Information Theft category, as well as dataset instances from the data exfiltration and keylogging subcategories. Our contribution is centered on the evaluation of ensemble feature selection techniques (FSTs) on classification performance for these specific attack instances. A group or ensemble of FSTs will often perform better than the best individual technique. The classifiers that we use are a diverse set of four ensemble learners (Light GBM, CatBoost, XGBoost, and random forest (RF)) and four non-ensemble learners (logistic regression (LR), decision tree (DT), Naive Bayes (NB), and a multi-layer perceptron (MLP)). The metrics used for evaluating classification performance are area under the receiver operating characteristic curve (AUC) and Area Under the precision-recall curve (AUPRC). For the most part, we determined that our ensemble FSTs do not affect classification performance but are beneficial because feature reduction eases computational burden and provides insight through improved data visualization.

Download Full-text

Research on the internal influence factors of the text multi-classification problem

MATEC Web of Conferences ◽

10.1051/matecconf/201817303072 ◽

2018 ◽

Vol 173 ◽

pp. 03072

Author(s):

Wu Mingqiang ◽

Furong Chang ◽

Kui Zhang

Keyword(s):

Text Classification ◽

Optical Network ◽

Influence Factors ◽

Lower Class ◽

Classification Problem ◽

Classification Method ◽

Internal Factors ◽

Text Type ◽

Class Definition

This paper mainly deals with the classification of text type data. The statistics show that more than 8000 articles have been reached in all kinds of documents retrieved by the optical network. However, there are few papers on the factors that affect the classification of text. The text classification method used is important, but the internal factors sometimes play a great role, and even affect the success or failure of the whole text classification. In order to make up for this deficiency, this paper selects the Rocchio algorithm as the classification method, mainly from the category clustering density, class complexity, category definition, stop words and document’s length five internal factors, we tested their influences on text classification by the experiment. Experiment shows that the clustering density is higher and the complexity of the lower class, class definition is higher, the higher the accuracy of text classification, text classification effect is better, and better effect to text stop words, the length of the text does not directly affect the effect of text classification, but according to the text classification algorithm is more suitable to choose the length of the document.

Download Full-text