Performance Assessment of Multiple Classifiers Based on Ensemble Feature Selection Scheme for Sentiment Analysis

Sentiment classification or sentiment analysis has been acknowledged as an open research domain. In recent years, an enormous research work is being performed in these fields by applying various numbers of methodologies. Feature generation and selection are consequent for text mining as the high-dimensional feature set can affect the performance of sentiment analysis. This paper investigates the inability or incompetency of the widely used feature selection methods (IG, Chi-square, and Gini Index) with unigram and bigram feature set on four machine learning classification algorithms (MNB, SVM, KNN, and ME). The proposed methods are evaluated on the basis of three standard datasets, namely, IMDb movie review and electronics and kitchen product review dataset. Initially, unigram and bigram features are extracted by applying n-gram method. In addition, we generate a composite features vector CompUniBi (unigram + bigram), which is sent to the feature selection methods Information Gain (IG), Gini Index (GI), and Chi-square (CHI) to get an optimal feature subset by assigning a score to each of the features. These methods offer a ranking to the features depending on their score; thus a prominent feature vector (CompIG, CompGI, and CompCHI) can be generated easily for classification. Finally, the machine learning classifiers SVM, MNB, KNN, and ME used prominent feature vector for classifying the review document into either positive or negative. The performance of the algorithm is measured by evaluation methods such as precision, recall, and F-measure. Experimental results show that the composite feature vector achieved a better performance than unigram feature, which is encouraging as well as comparable to the related research. The best results were obtained from the combination of Information Gain with SVM in terms of highest accuracy.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

Sentiment Analysis of Movie Reviews: A Study of Machine Learning Algorithms with Various Feature Selection Methods

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i9.113121 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 1

Author(s):

Rajwinder Kaur

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

Feature Selection Based on Binary Tree Growth Algorithm for the Classification of Myoelectric Signals

Machines ◽

10.3390/machines6040065 ◽

2018 ◽

Vol 6 (4) ◽

pp. 65 ◽

Cited By ~ 4

Author(s):

Jingwei Too ◽

Abdul Abdullah ◽

Norhashimah Mohd Saad ◽

Nursabillilah Mohd Ali

Keyword(s):

Feature Selection ◽

Tree Growth ◽

Binary Tree ◽

Feature Vector ◽

Classification Performance ◽

Feature Reduction ◽

Feature Subset ◽

Selection Methods ◽

Time Frequency ◽

Mutation Operators

Electromyography (EMG) has been widely used in rehabilitation and myoelectric prosthetic applications. However, a recent increment in the number of EMG features has led to a high dimensional feature vector. This in turn will degrade the classification performance and increase the complexity of the recognition system. In this paper, we have proposed two new feature selection methods based on a tree growth algorithm (TGA) for EMG signals classification. In the first approach, two transfer functions are implemented to convert the continuous TGA into a binary version. For the second approach, the swap, crossover, and mutation operators are introduced in a modified binary tree growth algorithm for enhancing the exploitation and exploration behaviors. In this study, short time Fourier transform (STFT) is employed to transform the EMG signals into time-frequency representation. The features are then extracted from the STFT coefficient and form a feature vector. Afterward, the proposed feature selection methods are applied to evaluate the best feature subset from a large available feature set. The experimental results show the superiority of MBTGA not only in terms of feature reduction, but also the classification performance.

Download Full-text

Chinese Sentiment Classifier Machine Learning Based on Optimized Information Gain Feature Selection

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.988.511 ◽

2014 ◽

Vol 988 ◽

pp. 511-516 ◽

Cited By ~ 3

Author(s):

Jin Tao Shi ◽

Hui Liang Liu ◽

Yuan Xu ◽

Jun Feng Yan ◽

Jian Feng Xu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Word Frequency ◽

Chinese Text ◽

Information Gain ◽

Classification Performance ◽

Selection Methods ◽

Text Feature ◽

Important Solution ◽

Feature Word

Machine learning is important solution in the research of Chinese text sentiment categorization , the text feature selection is critical to the classification performance. However, the classical feature selection methods have better effect on the global categories, but it misses many representative feature words of each category. This paper presents an improved information gain method that integrates word frequency and degree of feature word sentiment into traditional information gain methods. Experiments show that classifier improved by this method has better classification .

Download Full-text

Support Vector Machine Berbasis Feature Selection Untuk Sentiment Analysis Kepuasan Pelanggan Terhadap Pelayanan Warung dan Restoran Kuliner Kota Tegal

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.201855867 ◽

2018 ◽

Vol 5 (5) ◽

pp. 537 ◽

Cited By ~ 1

Author(s):

Oman Somantri ◽

Dyah Apriliani

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Support Vector ◽

Chi Square ◽

Proposed Model ◽

Chi Squared ◽

The Difference ◽

Increase In Accuracy

Abstrak Setiap pelanggan pasti menginginkan sebuah pendukung keputusan dalam menentukan pilihan ketika akan mengunjungi sebuah tempat makan atau kuliner yang sesuai dengan keinginan salah satu contohnya yaitu di Kota Tegal. Sentiment analysis digunakan untuk memberikan sebuah solusi terkait dengan permasalahan tersebut, dengan menereapkan model algoritma Support Vector Machine (SVM). Tujuan dari penelitian ini adalah mengoptimalisasi model yang dihasilkan dengan diterapkannya feature selection menggunakan algoritma Informatioan Gain (IG) dan Chi Square pada hasil model terbaik yang dihasilkan oleh SVM pada klasifikasi tingkat kepuasan pelanggan terhadap warung dan restoran kuliner di Kota Tegal sehingga terjadi peningkatan akurasi dari model yang dihasilkan. Hasil penelitian menunjukan bahwa tingkat akurasi terbaik dihasilkan oleh model SVM-IG dengan tingkat akurasi terbaik sebesar 72,45% mengalami peningkatan sekitar 3,08% yang awalnya 69.36%. Selisih rata-rata yang dihasilkan setelah dilakukannya optimasi SVM dengan feature selection adalah 2,51% kenaikan tingkat akurasinya. Berdasarkan hasil penelitian bahwa feature selection dengan menggunakan Information Gain (IG) (SVM-IG) memiliki tingkat akurasi lebih baik apabila dibandingkan SVM dan Chi Squared (SVM-CS) sehingga dengan demikian model yang diusulkan dapat meningkatkan tingkat akurasi yang dihasilkan oleh SVM menjadi lebih baik. Abstract The Customer needs to get a decision support in determining a choice when they’re visit a culinary restaurant accordance to their wishes especially at Tegal City. Sentiment analysis is used to provide a solution related to this problem by applying the Support Vector Machine (SVM) algorithm model. The purpose of this research is to optimize the generated model by applying feature selection using Informatioan Gain (IG) and Chi Square algorithm on the best model produced by SVM on the classification of customer satisfaction level based on culinary restaurants at Tegal City so that there is an increasing accuracy from the model. The results showed that the best accuracy level produced by the SVM-IG model with the best accuracy of 72.45% experienced an increase of about 3.08% which was initially 69.36%. The difference average produced after SVM optimization with feature selection is 2.51% increase in accuracy. Based on the results of the research, the feature selection using Information Gain (SVM-IG) has a better accuracy rate than SVM and Chi Squared (SVM-CS) so that the proposed model can improve the accuracy of SVM better.

Download Full-text

An Empirical Evaluation of Feature Selection Methods

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch012 ◽

2015 ◽

pp. 233-258 ◽

Cited By ~ 1

Author(s):

Mohsin Iqbal ◽

Saif Ur Rehman ◽

Saira Gillani ◽

Sohail Asghar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Classification Accuracy ◽

Information Gain ◽

Learning Algorithm ◽

Empirical Evaluation ◽

Machine Learning Algorithms ◽

Selection Methods ◽

The One ◽

Processing And Storage

The key objective of the chapter would be to study the classification accuracy, using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of the learning algorithm improved. We test how an integrated feature selection could affect the accuracy of three classifiers by performing feature selection methods. The filter effects show that Information Gain (IG), Gain Ratio (GR) and Relief-f, and wrapper effect show that Bagging and Naive Bayes (NB), enabled the classifiers to give the highest escalation in classification accuracy about the average while reducing the volume of unnecessary attributes. The achieved conclusions can advise the machine learning users, which classifier and feature selection methods to use to optimize the classification accuracy, and this can be important, especially at risk-sensitive applying Machine Learning whereas in the one of the aim to reduce costs of collecting, processing and storage of unnecessary data.

Download Full-text

PLncWX: A Machine-Learning Algorithm for Plant lncRNA Identification Based on WOA-XGBoost

Journal of Chemistry ◽

10.1155/2021/6256021 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Fei Guo ◽

Zhixiang Yin ◽

Kai Zhou ◽

Jiasi Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Stress Responses ◽

Learning Algorithm ◽

Feature Subset ◽

Selection Methods ◽

Regulate Gene Expression ◽

Model Redundancy ◽

Human And Mouse ◽

Plant Abiotic Stress

Long noncoding RNAs (lncRNAs) are a class of RNAs longer than 200 nt and cannot encode the protein. Studies have shown that lncRNAs can regulate gene expression at the epigenetic, transcriptional, and posttranscriptional levels, which are not only closely related to the occurrence, development, and prevention of human diseases, but also can regulate plant flowering and participate in plant abiotic stress responses such as drought and salt. Therefore, how to accurately and efficiently identify lncRNAs is still an essential job of relevant researches. There have been a large number of identification tools based on machine-learning and deep learning algorithms, mostly using human and mouse gene sequences as training sets, seldom plants, and only using one or one class of feature selection methods after feature extraction. We developed an identification model containing dicot, monocot, algae, moss, and fern. After comparing 20 feature selection methods (seven filter and thirteen wrapper methods) combined with seven classifiers, respectively, considering the correlation between features and model redundancy at the same time, we found that the WOA-XGBoost-based model had better performance with 91.55%, 96.78%, and 91.68% of accuracy, AUC, and F1_score. Meanwhile, the number of elements in the feature subset was reduced to 23, which effectively improved the prediction accuracy and modeling efficiency.

Download Full-text

The Impact of Feature Selection Methods for Classifying Arabic Textual Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7163.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1333-1338

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Space ◽

Support Vector ◽

Selection Methods ◽

K Nearest Neighbors ◽

Chi Square ◽

Selection Algorithms ◽

The Impact

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.

Download Full-text

An Extensive Text Mining Study for the Turkish Language

Advances in Business Information Systems and Analytics - Natural Language Processing for Global and Local Business ◽

10.4018/978-1-7998-4240-8.ch012 ◽

2021 ◽

pp. 272-306

Author(s):

Durmuş Özkan Şahin ◽

Erdal Kılıç

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Text Mining ◽

Language Processing ◽

Information Gain ◽

Learning Algorithms ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Classification Algorithms ◽

Chi Square

In this study, the authors give both theoretical and experimental information about text mining, which is one of the natural language processing topics. Three different text mining problems such as news classification, sentiment analysis, and author recognition are discussed for Turkish. They aim to reduce the running time and increase the performance of machine learning algorithms. Four different machine learning algorithms and two different feature selection metrics are used to solve these text classification problems. Classification algorithms are random forest (RF), logistic regression (LR), naive bayes (NB), and sequential minimal optimization (SMO). Chi-square and information gain metrics are used as the feature selection method. The highest classification performance achieved in this study is 0.895 according to the F-measure metric. This result is obtained by using the SMO classifier and information gain metric for news classification. This study is important in terms of comparing the performances of classification algorithms and feature selection methods.

Download Full-text

Improved ICHI square feature selection method for Arabic classifiers

International Journal of Informatics and Communication Technology (IJ-ICT) ◽

10.11591/ijict.v9i3.pp157-170 ◽

2020 ◽

Vol 9 (3) ◽

pp. 157

Author(s):

Hadeel N. Alshaer ◽

Mohammed A. Otair ◽

Laith Abualigah

Keyword(s):

Feature Selection ◽

Information Gain ◽

Feature Selection Method ◽

Selection Methods ◽

Chi Square ◽

Feature Selection Problem ◽

Time To Build ◽

Arabic Text Classification ◽

Almost All ◽

Text And Data Mining

Feature selection problem is one of the main important problems in the text and data mining domain. This paper presents a comparative study of feature selection methods for Arabic text classification. Five of the feature selection methods were selected: ICHI square, CHI square, Information Gain, Mutual Information and Wrapper. It was tested with five classification algorithms: Bayes Net, Naive Bayes, Random Forest, Decision Tree and Artificial Neural Networks. In addition, Data Collection was used in Arabic consisting of 9055 documents, which were compared by four criteria: Precision, Recall, F-measure and Time to build model. The results showed that the improved ICHI feature selection got almost all the best results in comparison with other methods.

Download Full-text