Multi-Step Iterative Algorithm for Feature Selection on Dynamic Documents

The authors propose clustering based multistep iterative algorithm. The important step is where terms are grouped by synonyms. It takes advantage of semantic relativity measure between the terms. Term frequency is computed of the group of synonyms by considering the relativity measure of the terms appearing in the document from the parent term in the group. This increases the importance of terms which though individually appear less frequently but together show their strong presence. The authors tried experiments on different real and artificial datasets such as NEWS 20, Reuters, emails, research papers on different topics. Resulted entropy shows that their algorithm gives improved result on certain set of documents which are well-articulated, such as research papers. The results are marginal on documents where the message is emphasized by repetitions of terms specifically the documents that are rapidly generated such as emails. The authors also observed that newly arrived documents get appropriately mapped based on proximity to the semantic group.

Download Full-text

Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection

BioMed Research International ◽

10.1155/2015/751646 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Yifei Chen ◽

Yuxing Sun ◽

Bing-Qing Han

Keyword(s):

Feature Selection ◽

Protein Interaction ◽

Text Classification ◽

Protein Interactions ◽

Reduction Rate ◽

Importance Measure ◽

Context Information ◽

Selection Methods ◽

Term Frequency ◽

Context Similarity

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of theF1measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

Download Full-text

t-Test feature selection approach based on term frequency for text categorization

Pattern Recognition Letters ◽

10.1016/j.patrec.2014.02.013 ◽

2014 ◽

Vol 45 ◽

pp. 1-10 ◽

Cited By ~ 52

Author(s):

Deqing Wang ◽

Hui Zhang ◽

Rui Liu ◽

Weifeng Lv ◽

Datao Wang

Keyword(s):

Feature Selection ◽

Text Categorization ◽

T Test ◽

Term Frequency ◽

Selection Approach ◽

Feature Selection Approach

Download Full-text

An automatic software vulnerability classification framework using term frequency-inverse gravity moment and feature selection

Journal of Systems and Software ◽

10.1016/j.jss.2020.110616 ◽

2020 ◽

Vol 167 ◽

pp. 110616

Author(s):

Jinfu Chen ◽

Patrick Kwaku Kudjo ◽

Solomon Mensah ◽

Selasie Aformaley Brown ◽

George Akorfu

Keyword(s):

Feature Selection ◽

Software Vulnerability ◽

Term Frequency ◽

Classification Framework ◽

Automatic Software

Download Full-text

Integrating Feature and Instance Selection Techniques in Opinion Mining

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2020070109 ◽

2020 ◽

Vol 16 (3) ◽

pp. 168-182

Author(s):

Zi-Hung You ◽

Ya-Han Hu ◽

Chih-Fong Tsai ◽

Yen-Ming Kuo

Keyword(s):

Feature Selection ◽

Opinion Mining ◽

Classification Performance ◽

Problem Instance ◽

Instance Selection ◽

Selection Methods ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Text Features

Opinion mining focuses on extracting polarity information from texts. For textual term representation, different feature selection methods, e.g. term frequency (TF) or term frequency–inverse document frequency (TF–IDF), can yield diverse numbers of text features. In text classification, however, a selected training set may contain noisy documents (or outliers), which can degrade the classification performance. To solve this problem, instance selection can be adopted to filter out unrepresentative training documents. Therefore, this article investigates the opinion mining performance associated with feature and instance selection steps simultaneously. Two combination processes based on performing feature selection and instance selection in different orders, were compared. Specifically, two feature selection methods, namely TF and TF–IDF, and two instance selection methods, namely DROP3 and IB3, were employed for comparison. The experimental results by using three Twitter datasets to develop sentiment classifiers showed that TF–IDF followed by DROP3 performs the best.

Download Full-text

Categorical term frequency probability based feature selection for document categorization

2013 International Conference on Soft Computing and Pattern Recognition (SoCPaR) ◽

10.1109/socpar.2013.7054103 ◽

2013 ◽

Cited By ~ 1

Author(s):

Qiang Li ◽

Liang He ◽

Xin Lin

Keyword(s):

Feature Selection ◽

Term Frequency ◽

Document Categorization ◽

Selection For ◽

Frequency Probability

Download Full-text

An Enhanced Hybrid Feature Selection Technique using Term Frequency-Inverse Document Frequency and Support Vector Machine-Recursive Feature Elimination for Sentiment Classification

IEEE Access ◽

10.1109/access.2021.3069001 ◽

2021 ◽

pp. 1-1

Author(s):

Nur Syafiqah Mohd Nafis ◽

Suryanti Awang

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Sentiment Classification ◽

Recursive Feature Elimination ◽

Support Vector ◽

Feature Selection Technique ◽

Inverse Document Frequency ◽

Selection Technique ◽

Term Frequency ◽

Document Frequency

Download Full-text

Improved Relative Term Frequency Probability Feature Selection for Document Categorization

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.548-549.1102 ◽

2014 ◽

Vol 548-549 ◽

pp. 1102-1109

Author(s):

Qiang Li ◽

Liang He ◽

Xin Lin

Keyword(s):

Feature Selection ◽

The Novel ◽

Distribution Factor ◽

Term Frequency ◽

Relative Term ◽

Document Frequency ◽

Document Categorization ◽

Selection For ◽

Frequency Probability ◽

Selection Algorithms

Feature selection is an important process to choose a subset of features relevant to a particular application in document classification. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. Firstly, based on the categorical document frequency probability (CTFP), a CTFP_VM feature selection algorithm was designed for feature selection. Secondly, a maximum term frequency conditional distribution factor was proposed to improve the CTFP_VM criterion further. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the excellent feature set for document categorization.

Download Full-text

Analisis Sentiment pada Sosial Media Twitter Menggunakan Naїve Bayes Classifier dengan Feature Selection Particle Swarm Optimization dan Term Frequency

Jurnal Informatika Universitas Pamulang ◽

10.32493/informatika.v2i1.1500 ◽

2017 ◽

Vol 2 (1) ◽

pp. 14

Author(s):

Yono Cahyono

Keyword(s):

Feature Selection ◽

Particle Swarm Optimization ◽

Naive Bayes ◽

Particle Swarm ◽

Naïve Bayes ◽

Naive Bayes Classifier ◽

Bayes Classifier ◽

Naïve Bayes Classifier ◽

Swarm Optimization ◽

Term Frequency

Pengguna media sosial saat ini sangat besar; dimana setiap orang mengungkapkan pendapat; komentar; kritik dan lain-lain. Data tersebut memberikan informasi yang berharga untuk dapat membantu orang atau organisasi dalam pengambilan keputusan. Jumlah data yang sangat besar tidak mungkin bagi manusia untuk membaca dan menganalisis secara manual. Ansalisis Sentiment merupakan proses dalam menganalisis; memahami; dan mengklasifikasi pendapat; evaluasi; penilaian; sikap; dan emosi terhadap suatu entitas tertentu seperti produk; jasa; organisasi; individu; peristiwa; topik; guna mendapatkan informasi. Penelitian ini bertujuan untuk memisahkan tweets berbahasa Indonesia pada media sosial twitter kedalam kategori positif; negatif dan netral. Metode naїve bayes Classifier (NBC) dengan feature selection Particle Swarm Optimization (PSO) diterapkan pada dataset untuk mengurangi atribut yang kurang relevan pada saat proses klasifikasi. Hasil pengujian menunjukan bahwa algoritma Naïve Bayes Classifier dengan feature selection Particle Swarm Optimization (PSO) menggunakan parameter term frequency (TF) dengan akurasi 97;48%.

Download Full-text

Aspect Category Extraction for Sentiment Analysis using Multivariate Filter Method of Feature Selection

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4566.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 2138-2143

Keyword(s):

Feature Selection ◽

Sentiment Analysis ◽

Correlation Coefficient ◽

Computation Time ◽

Filter Method ◽

Term Frequency ◽

Aspect Extraction ◽

Selection Strategies ◽

Two Phases ◽

Feature Selection Techniques

Aspect-oriented sentiment analysis is done in two phases like aspect term identification from review and determining related opinion. To carry out this analysis, features play an important role to determine the accuracy of the model. Feature extraction and feature selection techniques contribute to increase the classification accuracy. Feature selection strategies reduce computation time, improve prediction performance, and provides a higher understanding of the information in machine learning and pattern recognition applications etc. This work specifically focuses on aspect extraction from restaurant review dataset but can also be used for other datasets. In this system, we proposed a multivariate filter strategy of feature selection which works on lemma features. This method helps to select relevant features and avoid redundant ones. Initially, the extracted features undergo preprocessing and then the “term-frequency matrix” is generated which contains the occurrence count of features with respect to aspect category. In the next phase, different feature selection strategies are applied which includes selecting features based on correlation, weighted term frequency and weighted term frequency with the correlation coefficient. The performance of weighted term frequency with correlation coefficient approach is compared with the existing system and shows significant improvement in F1 score

Download Full-text

Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis

Scientific Journal of Informatics ◽

10.15294/sji.v6i1.14244 ◽

2019 ◽

Vol 6 (1) ◽

pp. 138-149

Author(s):

Ukhti Ikhsani Larasati ◽

Much Aziz Muslim ◽

Riza Arifudin ◽

Alamsyah Alamsyah

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Text Mining ◽

Sentiment Analysis ◽

Feature Weighting ◽

Support Vector ◽

Chi Square ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency

Data processing can be done with text mining techniques. To process large text data is required a machine to explore opinions, including positive or negative opinions. Sentiment analysis is a process that applies text mining methods. Sentiment analysis is a process that aims to determine the content of the dataset in the form of text is positive or negative. Support vector machine is one of the classification algorithms that can be used for sentiment analysis. However, support vector machine works less well on the large-sized data. In addition, in the text mining process there are constraints one is number of attributes used. With many attributes it will reduce the performance of the classifier so as to provide a low level of accuracy. The purpose of this research is to increase the support vector machine accuracy with implementation of feature selection and feature weighting. Feature selection will reduce a large number of irrelevant attributes. In this study the feature is selected based on the top value of K = 500. Once selected the relevant attributes are then performed feature weighting to calculate the weight of each attribute selected. The feature selection method used is chi square statistic and feature weighting using Term Frequency Inverse Document Frequency (TFIDF). Result of experiment using Matlab R2017b is integration of support vector machine with chi square statistic and TFIDF that uses 10 fold cross validation gives an increase of accuracy of 11.5% with the following explanation, the accuracy of the support vector machine without applying chi square statistic and TFIDF resulted in an accuracy of 68.7% and the accuracy of the support vector machine by applying chi square statistic and TFIDF resulted in an accuracy of 80.2%.

Download Full-text