scholarly journals A New Feature Selection Method for Text Classification Based on Independent Feature Space Search

2020 ◽  
Vol 2020 ◽  
pp. 1-14 ◽  
Author(s):  
Yong Liu ◽  
Shenggen Ju ◽  
Junfeng Wang ◽  
Chong Su

Feature selection method is designed to select the representative feature subsets from the original feature set by different evaluation of feature relevance, which focuses on reducing the dimension of the features while maintaining the predictive accuracy of a classifier. In this study, we propose a feature selection method for text classification based on independent feature space search. Firstly, a relative document-term frequency difference (RDTFD) method is proposed to divide the features in all text documents into two independent feature sets according to the features’ ability to discriminate the positive and negative samples, which has two important functions: one is to improve the high class correlation of the features and reduce the correlation between the features and the other is to reduce the search range of feature space and maintain appropriate feature redundancy. Secondly, the feature search strategy is used to search the optimal feature subset in independent feature space, which can improve the performance of text classification. Finally, we evaluate several experiments conduced on six benchmark corpora, the experimental results show the RDTFD method based on independent feature space search is more robust than the other feature selection methods.

Author(s):  
GULDEN UCHYIGIT ◽  
KEITH CLARK

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ2) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F1 and F2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Heyong Wang ◽  
Ming Hong

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.


2016 ◽  
Vol 78 (8-2) ◽  
Author(s):  
Jafreezal Jaafar ◽  
Zul Indra ◽  
Nurshuhaini Zamin

Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014 – 2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63% for language identification, and 97.5%% for category classification. While the category classifier works optimally on n = 60%, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification.


Author(s):  
ShuRui Li ◽  
Jing Jin ◽  
Ian Daly ◽  
Chang Liu ◽  
Andrzej Cichocki

Abstract Brain–computer interface (BCI) systems decode electroencephalogram signals to establish a channel for direct interaction between the human brain and the external world without the need for muscle or nerve control. The P300 speller, one of the most widely used BCI applications, presents a selection of characters to the user and performs character recognition by identifying P300 event-related potentials from the EEG. Such P300-based BCI systems can reach good levels of accuracy but are difficult to use in day-to-day life due to redundancy and noisy signal. A room for improvement should be considered. We propose a novel hybrid feature selection method for the P300-based BCI system to address the problem of feature redundancy, which combines the Menger curvature and linear discriminant analysis. First, selected strategies are applied separately to a given dataset to estimate the gain for application to each feature. Then, each generated value set is ranked in descending order and judged by a predefined criterion to be suitable in classification models. The intersection of the two approaches is then evaluated to identify an optimal feature subset. The proposed method is evaluated using three public datasets, i.e., BCI Competition III dataset II, BNCI Horizon dataset, and EPFL dataset. Experimental results indicate that compared with other typical feature selection and classification methods, our proposed method has better or comparable performance. Additionally, our proposed method can achieve the best classification accuracy after all epochs in three datasets. In summary, our proposed method provides a new way to enhance the performance of the P300-based BCI speller.


2020 ◽  
Vol 10 (2) ◽  
pp. 370-379 ◽  
Author(s):  
Jie Cai ◽  
Lingjing Hu ◽  
Zhou Liu ◽  
Ke Zhou ◽  
Huailing Zhang

Background: Mild cognitive impairment (MCI) patients are a high-risk group for Alzheimer's disease (AD). Each year, the diagnosed of 10–15% of MCI patients are converted to AD (MCI converters, MCI_C), while some MCI patients remain relatively stable, and unconverted (MCI stable, MCI_S). MCI patients are considered the most suitable population for early intervention treatment for dementia, and magnetic resonance imaging (MRI) is clinically the most recommended means of imaging examination. Therefore, using MRI image features to reliably predict the conversion from MCI to AD can help physicians carry out an effective treatment plan for patients in advance so to prevent or slow down the development of dementia. Methods: We proposed an embedded feature selection method based on the least squares loss function and within-class scatter to select the optimal feature subset. The optimal subsets of features were used for binary classification (AD, MCI_C, MCI_S, normal control (NC) in pairs) based on a support vector machine (SVM), and the optimal 3-class features were used for 3-class classification (AD, MCI_C, MCI_S, NC in triples) based on one-versus-one SVMs (OVOSVMs). To ensure the insensitivity of the results to the random train/test division, a 10-fold cross-validation has been repeated for each classification. Results: Using our method for feature selection, only 7 features were selected from the original 90 features. With using the optimal subset in the SVM, we classified MCI_C from MCI_S with an accuracy, sensitivity, and specificity of 71.17%, 68.33% and 73.97%, respectively. In comparison, in the 3-class classification (AD vs. MCI_C vs. MCI_S) with OVOSVMs, our method selected 24 features, and the classification accuracy was 81.9%. The feature selection results were verified to be identical to the conclusions of the clinical diagnosis. Our feature selection method achieved the best performance, comparing with the existing methods using lasso and fused lasso for feature selection. Conclusion: The results of this study demonstrate the potential of the proposed approach for predicting the conversion from MCI to AD by identifying the affected brain regions undergoing this conversion.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Canyi Huang ◽  
Keding Li ◽  
Jianqiang Du ◽  
Bin Nie ◽  
Guoliang Xu ◽  
...  

The basic experimental data of traditional Chinese medicine are generally obtained by high-performance liquid chromatography and mass spectrometry. The data often show the characteristics of high dimensionality and few samples, and there are many irrelevant features and redundant features in the data, which bring challenges to the in-depth exploration of Chinese medicine material information. A hybrid feature selection method based on iterative approximate Markov blanket (CI_AMB) is proposed in the paper. The method uses the maximum information coefficient to measure the correlation between features and target variables and achieves the purpose of filtering irrelevant features according to the evaluation criteria, firstly. The iterative approximation Markov blanket strategy analyzes the redundancy between features and implements the elimination of redundant features and then selects an effective feature subset finally. Comparative experiments using traditional Chinese medicine material basic experimental data and UCI’s multiple public datasets show that the new method has a better advantage to select a small number of highly explanatory features, compared with Lasso, XGBoost, and the classic approximate Markov blanket method.


2019 ◽  
Vol 9 (4) ◽  
pp. 665 ◽  
Author(s):  
Lu Zhang ◽  
Qingling Duan

Multi-label text classification refers to a text divided into multiple categories simultaneously, which corresponds to a text associated with multiple topics in the real world. The feature space generated by text data has the characteristics of high dimensionality and sparsity. Feature selection is an efficient technology that removes useless and redundant features, reduces the dimension of the feature space, and avoids dimension disaster. A feature selection method for multi-label text based on feature importance is proposed in this paper. Firstly, multi-label texts are transformed into single-label texts using the label assignment method. Secondly, the importance of each feature is calculated using the method based on Category Contribution (CC). Finally, features with higher importance are selected to construct the feature space. In the proposed method, the feature importance is calculated from the perspective of the category, which ensures the selected features have strong category discrimination ability. Specifically, the contributions of the features to each category from two aspects of inter-category and intra-category are calculated, then the importance of the features is obtained with the combination of them. The proposed method is tested on six public data sets and the experimental results are good, which demonstrates the effectiveness of the proposed method.


2018 ◽  
Vol 8 (11) ◽  
pp. 2143 ◽  
Author(s):  
Xianghong Tang ◽  
Jiachen Wang ◽  
Jianguang Lu ◽  
Guokai Liu ◽  
Jiadui Chen

Effective feature selection can help improve the classification performance in bearing fault diagnosis. This paper proposes a novel feature selection method based on bearing fault diagnosis called Feature-to-Feature and Feature-to-Category- Maximum Information Coefficient (FF-FC-MIC), which considers the relevance among features and relevance between features and fault categories by exploiting the nonlinearity capturing capability of maximum information coefficient. In this method, a weak correlation feature subset obtained by a Feature-to-Feature-Maximum Information Coefficient (FF-MIC) matrix and a strong correlation feature subset obtained by a Feature-to-Category-Maximum Information Coefficient (FC-MIC) matrix are merged into a final diagnostic feature set by an intersection operation. To evaluate the proposed FF-FC-MIC method, vibration data collected from two bearing fault experiment platforms (CWRU dataset and CUT-2 dataset) were employed. Experimental results showed that accuracy of FF-FC-MIC can achieve 97.50%, and 98.75% on the CWRU dataset at the motor speeds of 1750 rpm, and 1772 rpm, respectively, and reach 91.75%, 94.69%, and 99.07% on CUT-2 dataset at the motor speeds of 2000 rpm, 2500 rpm, 3000 rpm, respectively. A significant improvement of FF-FC-MIC has been confirmed, since the p-values between FF-FC-MIC and the other methods are 1.166 × 10 − 3 , 2.509 × 10 − 5 , and 3.576 × 10 − 2 , respectively. Through comparison with other methods, FF-FC-MIC not only exceeds each of the baseline feature selection method in diagnosis accuracy, but also reduces the number of features.


Sign in / Sign up

Export Citation Format

Share Document