scholarly journals A New Feature Selection Method for Sentiment Analysis in Short Text

2018 ◽  
Vol 29 (1) ◽  
pp. 1122-1134
Author(s):  
H. M. Keerthi Kumar ◽  
B. S. Harish

Abstract In recent internet era, micro-blogging sites produce enormous amount of short textual information, which appears in the form of opinions or sentiments of users. Sentiment analysis is a challenging task in short text, due to use of formal language, misspellings, and shortened forms of words, which leads to high dimensionality and sparsity. In order to deal with these challenges, this paper proposes a novel, simple, and yet effective feature selection method, to select frequently distributed features related to each class. In this paper, the feature selection method is based on class-wise information, to identify the relevant feature related to each class. We evaluate the proposed feature selection method by comparing with existing feature selection methods like chi-square ( χ2), entropy, information gain, and mutual information. The performances are evaluated using classification accuracy obtained from support vector machine, K nearest neighbors, and random forest classifiers on two publically available datasets viz., Stanford Twitter dataset and Ravikiran Janardhana dataset. In order to demonstrate the effectiveness of the proposed feature selection method, we conducted extensive experimentation by selecting different feature sets. The proposed feature selection method outperforms the existing feature selection methods in terms of classification accuracy on the Stanford Twitter dataset. Similarly, the proposed method performs competently equally in terms of classification accuracy compared to other feature selection methods in most of the feature subsets on Ravikiran Janardhana dataset.

Author(s):  
B. Venkatesh ◽  
J. Anuradha

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.


Author(s):  
Gang Liu ◽  
Chunlei Yang ◽  
Sen Liu ◽  
Chunbao Xiao ◽  
Bin Song

A feature selection method based on mutual information and support vector machine (SVM) is proposed in order to eliminate redundant feature and improve classification accuracy. First, local correlation between features and overall correlation is calculated by mutual information. The correlation reflects the information inclusion relationship between features, so the features are evaluated and redundant features are eliminated with analyzing the correlation. Subsequently, the concept of mean impact value (MIV) is defined and the influence degree of input variables on output variables for SVM network based on MIV is calculated. The importance weights of the features described with MIV are sorted by descending order. Finally, the SVM classifier is used to implement feature selection according to the classification accuracy of feature combination which takes MIV order of feature as a reference. The simulation experiments are carried out with three standard data sets of UCI, and the results show that this method can not only effectively reduce the feature dimension and high classification accuracy, but also ensure good robustness.


Author(s):  
GULDEN UCHYIGIT ◽  
KEITH CLARK

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ2) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F1 and F2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.


2014 ◽  
Vol 2014 ◽  
pp. 1-17 ◽  
Author(s):  
Jieming Yang ◽  
Zhaoyang Qu ◽  
Zhiying Liu

The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalance factor of dataset. In this paper, a new scheme was proposed, which can weaken the adverse effect caused by the imbalance factor in the corpus. We evaluated the improved versions of nine well-known feature-selection methods (Information Gain, Chi statistic, Document Frequency, Orthogonal Centroid Feature Selection, DIA association factor, Comprehensive Measurement Feature Selection, Deviation from Poisson Feature Selection, improved Gini index, and Mutual Information) using naïve Bayes and support vector machines on three benchmark document collections (20-Newsgroups, Reuters-21578, and WebKB). The experimental results show that the improved scheme can significantly enhance the performance of the feature-selection methods.


2020 ◽  
Vol 3 (1) ◽  
pp. 58-63
Author(s):  
Y. Mansour Mansour ◽  
Majed A. Alenizi

Emails are currently the main communication method worldwide as it proven in its efficiency. Phishing emails in the other hand is one of the major threats which results in significant losses, estimated at billions of dollars. Phishing emails is a more dynamic problem, a struggle between the phishers and defenders where the phishers have more flexibility in manipulating the emails features and evading the anti-phishing techniques. Many solutions have been proposed to mitigate the phishing emails impact on the targeted sectors, but none have achieved 100% detection and accuracy. As phishing techniques are evolving, the solutions need to be evolved and generalized in order to mitigate as much as possible. This article presents a new emergent classification model based on hybrid feature selection method that combines two common feature selection methods, Information Gain and Genetic Algorithm that keep only significant and high-quality features in the final classifier. The Proposed hybrid approach achieved 98.9% accuracy rate against phishing emails dataset comprising 8266 instances and results depict enhancement by almost 4%. Furthermore, the presented technique has contributed to reducing the search space by reducing the number of selected features.


2019 ◽  
Vol 8 (4) ◽  
pp. 1333-1338

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.


2020 ◽  
pp. 407-421
Author(s):  
Noria Bidi ◽  
Zakaria Elberrichi

This article presents a new adaptive algorithm called FS-SLOA (Feature Selection-Seven Spot Ladybird Optimization Algorithm) which is a meta-heuristic feature selection method based on the foraging behavior of a seven spot ladybird. The new efficient technique has been applied to find the best subset features, which achieves the highest accuracy in classification using three classifiers: the Naive Bayes (NB), the Nearest Neighbors (KNN) and the Support Vector Machine (SVM). The authors' proposed approach has been experimented on four well-known benchmark datasets (Wisconsin Breast cancer, Pima Diabetes, Mammographic Mass, and Dermatology datasets) taken from the UCI machine learning repository. Experimental results prove that the classification accuracy of FS-SLOA is the best performing for different datasets.


2020 ◽  
Vol 4 (1) ◽  
pp. 29
Author(s):  
Sasan Sarbast Abdulkhaliq ◽  
Aso Mohammad Darwesh

Nowadays, people from every part of the world use social media and social networks to express their feelings toward different topics and aspects. One of the trendiest social media is Twitter, which is a microblogging website that provides a platform for its users to share their views and feelings about products, services, events, etc., in public. Which makes Twitter one of the most valuable sources for collecting and analyzing data by researchers and developers to reveal people sentiment about different topics and services, such as products of commercial companies, services, well-known people such as politicians and athletes, through classifying those sentiments into positive and negative. Classification of people sentiment could be automated through using machine learning algorithms and could be enhanced through using appropriate feature selection methods. We collected most recent tweets about (Amazon, Trump, Chelsea FC, CR7) using Twitter-Application Programming Interface and assigned sentiment score using lexicon rule-based approach, then proposed a machine learning model to improve classification accuracy through using hybrid feature selection method, namely, filter-based feature selection method Chi-square (Chi-2) plus wrapper-based binary coordinate ascent (Chi-2 + BCA) to select optimal subset of features from term frequency-inverse document frequency (TF-IDF) generated features for classification through support vector machine (SVM), and Bag of words generated features for logistic regression (LR) classifiers using different n-gram ranges. After comparing the hybrid (Chi-2+BCA) method with (Chi-2) selected features, and also with the classifiers without feature subset selection, results show that the hybrid feature selection method increases classification accuracy in all cases. The maximum attained accuracy with LR is 86.55% using (1 + 2 + 3-g) range, with SVM is 85.575% using the unigram range, both in the CR7 dataset.


2021 ◽  
Vol 9 ◽  
Author(s):  
Naresh Mali ◽  
Varun Dutt ◽  
K. V. Uday

Landslide disaster risk reduction necessitates the investigation of different geotechnical causal factors for slope failures. Machine learning (ML) techniques have been proposed to study causal factors across many application areas. However, the development of ensemble ML techniques for identifying the geotechnical causal factors for slope failures and their subsequent prediction has lacked in literature. The primary goal of this research is to develop and evaluate novel feature selection methods for identifying causal factors for slope failures and assess the potential of ensemble and individual ML techniques for slope failure prediction. Twenty-one geotechnical causal factors were obtained from 60 sites (both landslide and non-landslide) spread across a landslide-prone area in Mandi, India. Relevant causal factors were evaluated by developing a novel ensemble feature selection method that involved an average of different individual feature selection methods like correlation, information-gain, gain-ratio, OneR, and F-ratio. Furthermore, different ensemble ML techniques (Random Forest (RF), AdaBoost (AB), Bagging, Stacking, and Voting) and individual ML techniques (Bayesian network (BN), decision tree (DT), multilayer perceptron (MLP), and support vector machine (SVM)) were calibrated to 70% of the locations and tested on 30% of the sites. The ensemble feature selection method yielded six major contributing parameters to slope failures: relative compaction, porosity, saturated permeability, slope angle, angle of the internal friction, and in-situ moisture content. Furthermore, the ensemble RF and AB techniques performed the best compared to other ensemble and individual ML techniques on test data. The present study discusses the implications of different causal factors for slope failure prediction.


2020 ◽  
Vol 20 ◽  
pp. 81-95
Author(s):  
Manal Nejjari ◽  
Abdelouafi Meziane

Sentiment Analysis, or Opinion Mining, has recently captivated the interest of scientists worldwide. With the increasing use of the internet, the web is becoming overloaded by data that contains useful information, which can be used in different fields. In fact, many studies have shed light on Sentiment Analysis of online data in different languages. However, the amount of research dealing with the Arabic language is still limited. In this paper, an empirical study is led to Sentiment Analysis of online reviews written in Modern Standard Arabic. A new system called SSAAR (System for Sentiment Analysis of Arabic Reviews) is proposed, allowing computational classification of reviews into three classes (positive, negative, neutral). The input data of this system is built by using a proposed framework called SPPARF (Scraping and double Preprocessing Arabic Reviews Framework), which generates a structured and clean dataset. Moreover, the provided system experiments two improved approaches for sentiment classification based on supervised learning, which are: Double preprocessing method and Feature selection method. Both approaches are trained by using five algorithms (Naïve Bayes, stochastic gradient descent Classifier (SGD), Logistic Regression, K-Nearest Neighbors, and Random Forest) and compared later under the same conditions. The experimental results show that the feature selection method using the SGD Classifier performs the best accuracy (77.1%). Therefore, the SSAAR System proved to be efficient and gives better results when using the feature selection method; nevertheless, satisfying results were obtained with the other approach, considered consequently suitable for the proposed system.


Sign in / Sign up

Export Citation Format

Share Document