scholarly journals A Statistical Parsing Framework for Sentiment Classification

2015 ◽  
Vol 41 (2) ◽  
pp. 293-336 ◽  
Author(s):  
Li Dong ◽  
Furu Wei ◽  
Shujie Liu ◽  
Ming Zhou ◽  
Ke Xu

We present a statistical parsing framework for sentence-level sentiment classification in this article. Unlike previous works that use syntactic parsing results for sentiment analysis, we develop a statistical parser to directly analyze the sentiment structure of a sentence. We show that complicated phenomena in sentiment analysis (e.g., negation, intensification, and contrast) can be handled the same way as simple and straightforward sentiment expressions in a unified and probabilistic way. We formulate the sentiment grammar upon Context-Free Grammars (CFGs), and provide a formal description of the sentiment parsing framework. We develop the parsing model to obtain possible sentiment parse trees for a sentence, from which the polarity model is proposed to derive the sentiment strength and polarity, and the ranking model is dedicated to selecting the best sentiment tree. We train the parser directly from examples of sentences annotated only with sentiment polarity labels but without any syntactic annotations or polarity annotations of constituents within sentences. Therefore we can obtain training data easily. In particular, we train a sentiment parser, s.parser, from a large amount of review sentences with users' ratings as rough sentiment polarity labels. Extensive experiments on existing benchmark data sets show significant improvements over baseline sentiment classification approaches.

Author(s):  
Cuong V. Nguyen ◽  
Khiem H. Le ◽  
Anh M. Tran ◽  
Binh T. Nguyen

With the booming development of E-commerce platforms in many counties, there is a massive amount of customers’ review data in different products and services. Understanding customers’ feedbacks in both current and new products can give online retailers the possibility to improve the product quality, meet customers’ expectations, and increase the corresponding revenue. In this paper, we investigate the Vietnamese sentiment classification problem on two datasets containing Vietnamese customers’ reviews. We propose eight different approaches, including Bi-LSTM, Bi-LSTM + Attention, Bi-GRU, Bi-GRU + Attention, Recurrent CNN, Residual CNN, Transformer, and PhoBERT, and conduct all experiments on two datasets, AIVIVN 2019 and our dataset self-collected from multiple Vietnamese e-commerce websites. The experimental results show that all our proposed methods outperform the winning solution of the competition “AIVIVN 2019 Sentiment Champion” with a significant margin. Especially, Recurrent CNN has the best performance in comparison with other algorithms in terms of both AUC (98.48%) and F1-score (93.42%) in this competition dataset and also surpasses other techniques in our dataset collected. Finally, we aim to publish our codes, and these two data-sets later to contribute to the current research community related to the field of sentiment analysis.


2016 ◽  
Vol 12 (4) ◽  
pp. 448-476 ◽  
Author(s):  
Amir Hosein Keyhanipour ◽  
Behzad Moshiri ◽  
Maryam Piroozmand ◽  
Farhad Oroumchian ◽  
Ali Moeini

Purpose Learning to rank algorithms inherently faces many challenges. The most important challenges could be listed as high-dimensionality of the training data, the dynamic nature of Web information resources and lack of click-through data. High dimensionality of the training data affects effectiveness and efficiency of learning algorithms. Besides, most of learning to rank benchmark datasets do not include click-through data as a very rich source of information about the search behavior of users while dealing with the ranked lists of search results. To deal with these limitations, this paper aims to introduce a novel learning to rank algorithm by using a set of complex click-through features in a reinforcement learning (RL) model. These features are calculated from the existing click-through information in the data set or even from data sets without any explicit click-through information. Design/methodology/approach The proposed ranking algorithm (QRC-Rank) applies RL techniques on a set of calculated click-through features. QRC-Rank is as a two-steps process. In the first step, Transformation phase, a compact benchmark data set is created which contains a set of click-through features. These feature are calculated from the original click-through information available in the data set and constitute a compact representation of click-through information. To find most effective click-through feature, a number of scenarios are investigated. The second phase is Model-Generation, in which a RL model is built to rank the documents. This model is created by applying temporal difference learning methods such as Q-Learning and SARSA. Findings The proposed learning to rank method, QRC-rank, is evaluated on WCL2R and LETOR4.0 data sets. Experimental results demonstrate that QRC-Rank outperforms the state-of-the-art learning to rank methods such as SVMRank, RankBoost, ListNet and AdaRank based on the precision and normalized discount cumulative gain evaluation criteria. The use of the click-through features calculated from the training data set is a major contributor to the performance of the system. Originality/value In this paper, we have demonstrated the viability of the proposed features that provide a compact representation for the click through data in a learning to rank application. These compact click-through features are calculated from the original features of the learning to rank benchmark data set. In addition, a Markov Decision Process model is proposed for the learning to rank problem using RL, including the sets of states, actions, rewarding strategy and the transition function.


2021 ◽  
Vol 11 (18) ◽  
pp. 8489
Author(s):  
Girma Neshir ◽  
Andreas Rauber ◽  
Solomon Atnafu

The emergence of the World Wide Web facilitates the growth of user-generated texts in less-resourced languages. Sentiment analysis of these texts may serve as a key performance indicator of the quality of services delivered by companies and government institutions. The presence of user-generated texts is an opportunity for assisting managers and policy-makers. These texts are used to improve performance and increase the level of customers’ satisfaction. Because of this potential, sentiment analysis has been widely researched in the past few years. A plethora of approaches and tools have been developed—albeit predominantly for well-resourced languages such as English. Resources for less-resourced languages such as, in this paper, Amharic, are much less developed. As a result, it requires cost-effective approaches and massive amounts of annotated training data, calling for different approaches to be applied. This research investigates the performance of a combination of heterogeneous machine learning algorithms (base learners such as SVM, RF, and NB). These models in the framework are fused by a meta-learner (in this case, logistic regression) for Amharic sentiment classification. An annotated corpus is provided for evaluation of the classification framework. The proposed stacked approach applying SMOTE on TF-IDF characters (1,7) grams features has achieved an accuracy of 90%. The overall results of the meta-learner (i.e., stack ensemble) have revealed performance rise over the base learners with TF-IDF character n-grams.


2017 ◽  
Vol 8 (3) ◽  
pp. 24-36 ◽  
Author(s):  
Rabindra K. Barik ◽  
Rojalina Priyadarshini ◽  
Nilamadhab Dash

The paper contains an extensive experimental study which focuses on a major idea on Target Optimization (TO) prior to the training process of artificial machines. Generally, during training process of an artificial machine, output is computed from two important parameters i.e. input and target. In general practice input is taken from the training data and target is randomly chosen, which may not be relevant to the corresponding training data. Hence, the overall training of the neural network becomes inefficient. The present study tries to put forward TO as an efficient methodology which may be helpful in addressing the said problem. The proposed work tries to implement the concept of TO and compares the outcomes with the conventional classifiers. In this regard, different benchmark data sets are used to compare the effect of TO on data classification by using Particle Swarm Optimization (PSO) and Gravitational Search Algorithm (GSA) optimization techniques.


2018 ◽  
Vol 42 (3) ◽  
pp. 343-354 ◽  
Author(s):  
Mike Thelwall

Purpose The purpose of this paper is to investigate whether machine learning induces gender biases in the sense of results that are more accurate for male authors or for female authors. It also investigates whether training separate male and female variants could improve the accuracy of machine learning for sentiment analysis. Design/methodology/approach This paper uses ratings-balanced sets of reviews of restaurants and hotels (3 sets) to train algorithms with and without gender selection. Findings Accuracy is higher on female-authored reviews than on male-authored reviews for all data sets, so applications of sentiment analysis using mixed gender data sets will over represent the opinions of women. Training on same gender data improves performance less than having additional data from both genders. Practical implications End users of sentiment analysis should be aware that its small gender biases can affect the conclusions drawn from it and apply correction factors when necessary. Users of systems that incorporate sentiment analysis should be aware that performance will vary by author gender. Developers do not need to create gender-specific algorithms unless they have more training data than their system can cope with. Originality/value This is the first demonstration of gender bias in machine learning sentiment analysis.


2019 ◽  
Vol 5 ◽  
pp. e194 ◽  
Author(s):  
Hyukjun Gweon ◽  
Matthias Schonlau ◽  
Stefan H. Steiner

The k nearest neighbor (kNN) approach is a simple and effective nonparametric algorithm for classification. One of the drawbacks of kNN is that the method can only give coarse estimates of class probabilities, particularly for low values of k. To avoid this drawback, we propose a new nonparametric classification method based on nearest neighbors conditional on each class: the proposed approach calculates the distance between a new instance and the kth nearest neighbor from each class, estimates posterior probabilities of class memberships using the distances, and assigns the instance to the class with the largest posterior. We prove that the proposed approach converges to the Bayes classifier as the size of the training data increases. Further, we extend the proposed approach to an ensemble method. Experiments on benchmark data sets show that both the proposed approach and the ensemble version of the proposed approach on average outperform kNN, weighted kNN, probabilistic kNN and two similar algorithms (LMkNN and MLM-kHNN) in terms of the error rate. A simulation shows that kCNN may be useful for estimating posterior probabilities when the class distributions overlap.


Sentiment analysis, also known as Opinion Mining is one of the hottest topic Nowadays. in various social networking sites is one of the hottest topic and field nowadays. Here, we are using Twitter, the biggest web destinations for people to communicate with each other to perform the sentiment analysis and opinion mining by extracting the tweets by various users. The users can post brief text updates in twitter as it only allows 140 characters in one text message. Hashtags helps to search for tweets dealing with the specified subject. In previous researches, binary classification usually relies on the sentiment polarity(Positive , Negative and Neutral). The advantage is that multiple meaning of the same world might have different polarity, so it can be easily identified. In Multiclass classification, many tweets of one class are classified as if they belong to the others. The Neutral class presented the lowest precision in all the researches happened in this particular area. The set of tweets containing text and emoticon data will be classified into 13 classes. From each tweet, we extract different set of features using one hot encoding algorithm and use machine learning algorithms to perform classification. The entire tweets will be divided into training data sets and testing data sets. Training dataset will be pre-processed and classified using various Artificial Neural Network algorithms such as Reccurent Neural Network, Convolutional Neural Network etc. Moreover, the same procedure will be followed for the Text and Emoticon data. The developed model or system will be tested using the testing dataset. More precise and correct accuracy can be obtained or experienced using this multiclass classification of text and emoticons. 4 Key performance indicators will be used to evaluate the effectiveness of the corresponding approach.


2021 ◽  
Vol 35 (4) ◽  
pp. 307-314
Author(s):  
Redouane Karsi ◽  
Mounia Zaim ◽  
Jamila El Alami

Traditionally, pharmacovigilance data are collected during clinical trials on a small sample of patients and are therefore insufficient to adequately assess drugs. Nowadays, consumers use online drug forums to share their opinions and experiences about medication. These feedbacks, which are widely available on the web, are automatically analyzed to extract relevant information for decision-making. Currently, sentiment analysis methods are being put forward to leverage consumers' opinions and produce useful drug monitoring indicators. However, these methods' effectiveness depends on the quality of word representation, which presents a real challenge because the information contained in user reviews is noisy and very subjective. Over time, several sentiment classification problems use machine learning methods based on the traditional bag of words model, sometimes enhanced with lexical resources. In recent years, word embedding models have significantly improved classification performance due to their ability to capture words' syntactic and semantic properties. Unfortunately, these latter models are weak in sentiment classification tasks because they are unable to encode sentiment information in the word representation. Indeed, two words with opposite polarities can have close word embeddings as they appear together in the same context. To overcome this drawback, some studies have proposed refining pre-trained word embeddings with lexical resources or learning word embeddings using training data. However, these models depend on external resources and are complex to implement. This work proposes a deep contextual word embeddings model called ELMo that inherently captures the sentiment information by providing separate vectors for words with opposite polarities. Different variants of our proposed model are compared with a benchmark of pre-trained word embeddings models using SVM classifier trained on Drug Review Dataset. Experimental results show that ELMo embeddings improve classification performance in sentiment analysis tasks on the pharmaceutical domain.


2014 ◽  
Vol 644-650 ◽  
pp. 2009-2012 ◽  
Author(s):  
Hai Tao Zhang ◽  
Bin Jun Wang

In order to solve the low efficiency problem of KNN or K-Means like algorithms in classification, a novel extension distance of interval is proposed to measure the similarity between testing data and the class domain. The method constructs representatives for data points in shorter time than traditional methods which replace original dataset to serve as the basis of classification. Virtually, the construction of the model containing representatives makes classification faster. Experimental results from two benchmark data sets, verify the effectiveness and applicability of the proposed work. The model based method using extension distance can effectively build data models to represent whole training data, and thus a high cost of classifying new instances problem is solved.


Author(s):  
Jingjing Wang ◽  
Jie Li ◽  
Shoushan Li ◽  
Yangyang Kang ◽  
Min Zhang ◽  
...  

Aspect sentiment classification, a challenging task in sentiment analysis, has been attracting more and more attention in recent years. In this paper, we highlight the need for incorporating the importance degrees of both words and clauses inside a sentence and propose a hierarchical network with both word-level and clause-level attentions to aspect sentiment classification. Specifically, we first adopt sentence-level discourse segmentation to segment a sentence into several clauses. Then, we leverage multiple Bi-directional LSTM layers to encode all clauses and propose a word-level attention layer to capture the importance degrees of words in each clause. Third and finally, we leverage another Bi-directional LSTM layer to encode the outputs from the former layers and propose a clause-level attention layer to capture the importance degrees of all the clauses inside a sentence. Experimental results on the laptop and restaurant datasets from SemEval-2015 demonstrate the effectiveness of our proposed approach to aspect sentiment classification.


Sign in / Sign up

Export Citation Format

Share Document