Multi-Step Iterative Algorithm for Feature Selection on Dynamic Documents

2016 ◽  
Vol 6 (2) ◽  
pp. 24-40 ◽  
Author(s):  
Prafulla Bharat Bafna ◽  
Shailaja Shirwaikar ◽  
Dhanya Pramod

The authors propose clustering based multistep iterative algorithm. The important step is where terms are grouped by synonyms. It takes advantage of semantic relativity measure between the terms. Term frequency is computed of the group of synonyms by considering the relativity measure of the terms appearing in the document from the parent term in the group. This increases the importance of terms which though individually appear less frequently but together show their strong presence. The authors tried experiments on different real and artificial datasets such as NEWS 20, Reuters, emails, research papers on different topics. Resulted entropy shows that their algorithm gives improved result on certain set of documents which are well-articulated, such as research papers. The results are marginal on documents where the message is emphasized by repetitions of terms specifically the documents that are rapidly generated such as emails. The authors also observed that newly arrived documents get appropriately mapped based on proximity to the semantic group.

2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Yifei Chen ◽  
Yuxing Sun ◽  
Bing-Qing Han

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of theF1measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.


2014 ◽  
Vol 45 ◽  
pp. 1-10 ◽  
Author(s):  
Deqing Wang ◽  
Hui Zhang ◽  
Rui Liu ◽  
Weifeng Lv ◽  
Datao Wang

2020 ◽  
Vol 167 ◽  
pp. 110616
Author(s):  
Jinfu Chen ◽  
Patrick Kwaku Kudjo ◽  
Solomon Mensah ◽  
Selasie Aformaley Brown ◽  
George Akorfu

2020 ◽  
Vol 16 (3) ◽  
pp. 168-182
Author(s):  
Zi-Hung You ◽  
Ya-Han Hu ◽  
Chih-Fong Tsai ◽  
Yen-Ming Kuo

Opinion mining focuses on extracting polarity information from texts. For textual term representation, different feature selection methods, e.g. term frequency (TF) or term frequency–inverse document frequency (TF–IDF), can yield diverse numbers of text features. In text classification, however, a selected training set may contain noisy documents (or outliers), which can degrade the classification performance. To solve this problem, instance selection can be adopted to filter out unrepresentative training documents. Therefore, this article investigates the opinion mining performance associated with feature and instance selection steps simultaneously. Two combination processes based on performing feature selection and instance selection in different orders, were compared. Specifically, two feature selection methods, namely TF and TF–IDF, and two instance selection methods, namely DROP3 and IB3, were employed for comparison. The experimental results by using three Twitter datasets to develop sentiment classifiers showed that TF–IDF followed by DROP3 performs the best.


2014 ◽  
Vol 548-549 ◽  
pp. 1102-1109
Author(s):  
Qiang Li ◽  
Liang He ◽  
Xin Lin

Feature selection is an important process to choose a subset of features relevant to a particular application in document classification. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. Firstly, based on the categorical document frequency probability (CTFP), a CTFP_VM feature selection algorithm was designed for feature selection. Secondly, a maximum term frequency conditional distribution factor was proposed to improve the CTFP_VM criterion further. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the excellent feature set for document categorization.


2017 ◽  
Vol 2 (1) ◽  
pp. 14
Author(s):  
Yono Cahyono

Pengguna media sosial saat ini sangat besar; dimana setiap orang mengungkapkan pendapat; komentar; kritik dan lain-lain. Data tersebut memberikan informasi yang berharga untuk dapat membantu orang atau organisasi dalam pengambilan keputusan. Jumlah data yang sangat besar tidak mungkin bagi manusia untuk membaca dan menganalisis secara manual. Ansalisis Sentiment merupakan proses dalam menganalisis; memahami; dan mengklasifikasi pendapat; evaluasi; penilaian; sikap; dan emosi terhadap suatu entitas tertentu seperti produk; jasa; organisasi; individu; peristiwa; topik; guna mendapatkan informasi. Penelitian ini bertujuan untuk memisahkan tweets berbahasa Indonesia pada media sosial twitter kedalam kategori positif; negatif dan netral. Metode naїve bayes Classifier (NBC) dengan feature selection Particle Swarm Optimization (PSO) diterapkan pada dataset untuk mengurangi atribut yang kurang relevan pada saat proses klasifikasi. Hasil pengujian menunjukan bahwa algoritma Naïve Bayes Classifier dengan feature selection Particle Swarm Optimization (PSO) menggunakan parameter term frequency (TF) dengan akurasi 97;48%.


2019 ◽  
Vol 8 (3) ◽  
pp. 2138-2143

Aspect-oriented sentiment analysis is done in two phases like aspect term identification from review and determining related opinion. To carry out this analysis, features play an important role to determine the accuracy of the model. Feature extraction and feature selection techniques contribute to increase the classification accuracy. Feature selection strategies reduce computation time, improve prediction performance, and provides a higher understanding of the information in machine learning and pattern recognition applications etc. This work specifically focuses on aspect extraction from restaurant review dataset but can also be used for other datasets. In this system, we proposed a multivariate filter strategy of feature selection which works on lemma features. This method helps to select relevant features and avoid redundant ones. Initially, the extracted features undergo preprocessing and then the “term-frequency matrix” is generated which contains the occurrence count of features with respect to aspect category. In the next phase, different feature selection strategies are applied which includes selecting features based on correlation, weighted term frequency and weighted term frequency with the correlation coefficient. The performance of weighted term frequency with correlation coefficient approach is compared with the existing system and shows significant improvement in F1 score


2019 ◽  
Vol 6 (1) ◽  
pp. 138-149
Author(s):  
Ukhti Ikhsani Larasati ◽  
Much Aziz Muslim ◽  
Riza Arifudin ◽  
Alamsyah Alamsyah

Data processing can be done with text mining techniques. To process large text data is required a machine to explore opinions, including positive or negative opinions. Sentiment analysis is a process that applies text mining methods. Sentiment analysis is a process that aims to determine the content of the dataset in the form of text is positive or negative. Support vector machine is one of the classification algorithms that can be used for sentiment analysis. However, support vector machine works less well on the large-sized data. In addition, in the text mining process there are constraints one is number of attributes used. With many attributes it will reduce the performance of the classifier so as to provide a low level of accuracy. The purpose of this research is to increase the support vector machine accuracy with implementation of feature selection and feature weighting. Feature selection will reduce a large number of irrelevant attributes. In this study the feature is selected based on the top value of K = 500. Once selected the relevant attributes are then performed feature weighting to calculate the weight of each attribute selected. The feature selection method used is chi square statistic and feature weighting using Term Frequency Inverse Document Frequency (TFIDF). Result of experiment using Matlab R2017b is integration of support vector machine with chi square statistic and TFIDF that uses 10 fold cross validation gives an increase of accuracy of 11.5% with the following explanation, the accuracy of the support vector machine without applying chi square statistic and TFIDF resulted in an accuracy of 68.7% and the accuracy of the support vector machine by applying chi square statistic and TFIDF resulted in an accuracy of 80.2%.


Sign in / Sign up

Export Citation Format

Share Document