A Comparative Study on Feature Selection in Chinese Text Classification Problem

2013 ◽  
Vol 380-384 ◽  
pp. 2854-2857
Author(s):  
Hu Li ◽  
Peng Zou ◽  
Wei Hong Han

Information explosion brings lots of challenges to text classification. The dimension disaster led to a sharp increase of computational complexity and lower classification accuracy. Therefore, it is critical to use feature selection techniques before actual classification. Automatic classification of English text has been researched for many years, but little on Chinese text. In this paper, several classic feature selection methods, namely TF, IG and CHI, are compared on classifying Chinese text. Meanwhile, we take imbalanced data into consideration in the paper. Experimental results show that CHI performed better than IG and TF when the dataset is imbalanced, but no obvious difference on balanced data.

Author(s):  
VLADIMIR NIKULIN ◽  
TIAN-HSIANG HUANG ◽  
GEOFFREY J. MCLACHLAN

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.


2020 ◽  
pp. 707-725
Author(s):  
Sujata Dash

Efficient classification and feature extraction techniques pave an effective way for diagnosing cancers from microarray datasets. It has been observed that the conventional classification techniques have major limitations in discriminating the genes accurately. However, such kind of problems can be addressed by an ensemble technique to a great extent. In this paper, a hybrid RotBagg ensemble framework has been proposed to address the problem specified above. This technique is an integration of Rotation Forest and Bagging ensemble which in turn preserves the basic characteristics of ensemble architecture i.e., diversity and accuracy. Three different feature selection techniques are employed to select subsets of genes to improve the effectiveness and generalization of the RotBagg ensemble. The efficiency is validated through five microarray datasets and also compared with the results of base learners. The experimental results show that the correlation based FRFR with PCA-based RotBagg ensemble form a highly efficient classification model.


2019 ◽  
Vol 45 (1) ◽  
pp. 11-14
Author(s):  
Zuhair Ali

Automated classification of text into predefined categories has always been considered as a vital method in thenatural language processing field. In this paper new methods based on Radial Basis Function (RBF) and Fuzzy Radial BasisFunction (FRBF) are used to solve the problem of text classification, where a set of features extracted for each sentencein the document collection these set of features introduced to FRBF and RBF to classify documents. Reuters 21578 datasetutilized for the purpose of text classification. The results showed the effectiveness of FRBF is better than RBF.


2022 ◽  
Vol 9 (1) ◽  
Author(s):  
Joffrey L. Leevy ◽  
John Hancock ◽  
Taghi M. Khoshgoftaar ◽  
Jared M. Peterson

AbstractThe recent years have seen a proliferation of Internet of Things (IoT) devices and an associated security risk from an increasing volume of malicious traffic worldwide. For this reason, datasets such as Bot-IoT were created to train machine learning classifiers to identify attack traffic in IoT networks. In this study, we build predictive models with Bot-IoT to detect attacks represented by dataset instances from the Information Theft category, as well as dataset instances from the data exfiltration and keylogging subcategories. Our contribution is centered on the evaluation of ensemble feature selection techniques (FSTs) on classification performance for these specific attack instances. A group or ensemble of FSTs will often perform better than the best individual technique. The classifiers that we use are a diverse set of four ensemble learners (Light GBM, CatBoost, XGBoost, and random forest (RF)) and four non-ensemble learners (logistic regression (LR), decision tree (DT), Naive Bayes (NB), and a multi-layer perceptron (MLP)). The metrics used for evaluating classification performance are area under the receiver operating characteristic curve (AUC) and Area Under the precision-recall curve (AUPRC). For the most part, we determined that our ensemble FSTs do not affect classification performance but are beneficial because feature reduction eases computational burden and provides insight through improved data visualization.


2018 ◽  
Vol 173 ◽  
pp. 03072
Author(s):  
Wu Mingqiang ◽  
Furong Chang ◽  
Kui Zhang

This paper mainly deals with the classification of text type data. The statistics show that more than 8000 articles have been reached in all kinds of documents retrieved by the optical network. However, there are few papers on the factors that affect the classification of text. The text classification method used is important, but the internal factors sometimes play a great role, and even affect the success or failure of the whole text classification. In order to make up for this deficiency, this paper selects the Rocchio algorithm as the classification method, mainly from the category clustering density, class complexity, category definition, stop words and document’s length five internal factors, we tested their influences on text classification by the experiment. Experiment shows that the clustering density is higher and the complexity of the lower class, class definition is higher, the higher the accuracy of text classification, text classification effect is better, and better effect to text stop words, the length of the text does not directly affect the effect of text classification, but according to the text classification algorithm is more suitable to choose the length of the document.


Sign in / Sign up

Export Citation Format

Share Document