One for “All”: a unified model for fine-grained sentiment analysis under three tasks

PeerJ Computer Science ◽

10.7717/peerj-cs.816 ◽

2021 ◽

Vol 7 ◽

pp. e816

Author(s):

Heng-yang Lu ◽

Jun Yang ◽

Cong Hu ◽

Wei Fang

Keyword(s):

Sentiment Analysis ◽

Data Augmentation ◽

Language Model ◽

Unified Model ◽

Training Data ◽

Low Resource ◽

Fine Grained ◽

Questions And Answers ◽

Resource Conditions ◽

Media Data

Background Fine-grained sentiment analysis is used to interpret consumers’ sentiments, from their written comments, towards specific entities on specific aspects. Previous researchers have introduced three main tasks in this field (ABSA, TABSA, MEABSA), covering all kinds of social media data (e.g., review specific, questions and answers, and community-based). In this paper, we identify and address two common challenges encountered in these three tasks, including the low-resource problem and the sentiment polarity bias. Methods We propose a unified model called PEA by integrating data augmentation methodology with the pre-trained language model, which is suitable for all the ABSA, TABSA and MEABSA tasks. Two data augmentation methods, which are entity replacement and dual noise injection, are introduced to solve both challenges at the same time. An ensemble method is also introduced to incorporate the results of the basic RNN-based and BERT-based models. Results PEA shows significant improvements on all three fine-grained sentiment analysis tasks when compared with state-of-the-art models. It also achieves comparable results with what the baseline models obtain while using only 20% of their training data, which demonstrates its extraordinary performance under extreme low-resource conditions.

Download Full-text

Semi-Supervised Aspect-Based Sentiment Analysis for Case-Related Microblog Reviews Using Case Knowledge Graph Embedding

International Journal of Asian Language Processing ◽

10.1142/s2717554520500125 ◽

2021 ◽

pp. 2050012

Author(s):

Peilian Zhao ◽

Cunli Mao ◽

Zhengtao Yu

Keyword(s):

Sentiment Analysis ◽

Domain Knowledge ◽

Opinion Mining ◽

Data Augmentation ◽

Training Data ◽

Knowledge Graph ◽

Fine Grained ◽

Learning Framework ◽

Proposed Model ◽

Real World Applications

Aspect-Based Sentiment Analysis (ABSA), a fine-grained task of opinion mining, which aims to extract sentiment of specific target from text, is an important task in many real-world applications, especially in the legal field. Therefore, in this paper, we study the problem of limitation of labeled training data required and ignorance of in-domain knowledge representation for End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) in legal field. We proposed a new method under deep learning framework, named Semi-ETEKGs, which applied E2E framework using knowledge graph (KG) embedding in legal field after data augmentation (DA). Specifically, we pre-trained the BERT embedding and in-domain KG embedding for unlabeled data and labeled data with case elements after DA, and then we put two embeddings into the E2E framework to classify the polarity of target-entity. Finally, we built a case-related dataset based on a popular benchmark for ABSA to prove the efficiency of Semi-ETEKGs, and experiments on case-related dataset from microblog comments show that our proposed model outperforms the other compared methods significantly.

Download Full-text

Knowing What, How and Why: A Near Complete Solution for Aspect-Based Sentiment Analysis

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6383 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8600-8607

Author(s):

Haiyun Peng ◽

Lu Xu ◽

Lidong Bing ◽

Fei Huang ◽

Wei Lu ◽

...

Keyword(s):

Sentiment Analysis ◽

State Of The Art ◽

Complete Solution ◽

Unified Model ◽

Two Stage ◽

Fine Grained ◽

Aspect Extraction ◽

Second Stage ◽

Opinion Extraction ◽

Complete Story

Target-based sentiment analysis or aspect-based sentiment analysis (ABSA) refers to addressing various sentiment analysis tasks at a fine-grained level, which includes but is not limited to aspect extraction, aspect sentiment classification, and opinion extraction. There exist many solvers of the above individual subtasks or a combination of two subtasks, and they can work together to tell a complete story, i.e. the discussed aspect, the sentiment on it, and the cause of the sentiment. However, no previous ABSA research tried to provide a complete solution in one shot. In this paper, we introduce a new subtask under ABSA, named aspect sentiment triplet extraction (ASTE). Particularly, a solver of this task needs to extract triplets (What, How, Why) from the inputs, which show WHAT the targeted aspects are, HOW their sentiment polarities are and WHY they have such polarities (i.e. opinion reasons). For instance, one triplet from “Waiters are very friendly and the pasta is simply average” could be (‘Waiters’, positive, ‘friendly’). We propose a two-stage framework to address this task. The first stage predicts what, how and why in a unified model, and then the second stage pairs up the predicted what (how) and why from the first stage to output triplets. In the experiments, our framework has set a benchmark performance in this novel triplet extraction task. Meanwhile, it outperforms a few strong baselines adapted from state-of-the-art related methods.

Download Full-text

A review: preprocessing techniques and data augmentation for sentiment analysis

Computational Social Networks ◽

10.1186/s40649-020-00080-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Huu-Thanh Duong ◽

Tram-Anh Nguyen-Thi

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Supervised Learning ◽

Data Augmentation ◽

Original Data ◽

Training Data ◽

Unseen Data ◽

Augmentation Techniques ◽

User Intervention

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Download Full-text

Deep Persian sentiment analysis: Cross-lingual training for low-resource languages

Journal of Information Science ◽

10.1177/0165551520962781 ◽

2020 ◽

pp. 016555152096278

Author(s):

Rouzbeh Ghasemi ◽

Seyed Arad Ashrafi Asli ◽

Saeedeh Momtazi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Low Resource ◽

Proposed Model ◽

Significant Difference ◽

Cross Lingual

With the advent of deep neural models in natural language processing tasks, having a large amount of training data plays an essential role in achieving accurate models. Creating valid training data, however, is a challenging issue in many low-resource languages. This problem results in a significant difference between the accuracy of available natural language processing tools for low-resource languages compared with rich languages. To address this problem in the sentiment analysis task in the Persian language, we propose a cross-lingual deep learning framework to benefit from available training data of English. We deployed cross-lingual embedding to model sentiment analysis as a transfer learning model which transfers a model from a rich-resource language to low-resource ones. Our model is flexible to use any cross-lingual word embedding model and any deep architecture for text classification. Our experiments on English Amazon dataset and Persian Digikala dataset using two different embedding models and four different classification networks show the superiority of the proposed model compared with the state-of-the-art monolingual techniques. Based on our experiment, the performance of Persian sentiment analysis improves 22% in static embedding and 9% in dynamic embedding. Our proposed model is general and language-independent; that is, it can be used for any low-resource language, once a cross-lingual embedding is available for the source–target language pair. Moreover, by benefitting from word-aligned cross-lingual embedding, the only required data for a reliable cross-lingual embedding is a bilingual dictionary that is available between almost all languages and the English language, as a potential source language.

Download Full-text

Dialog State Tracking with Reinforced Data Augmentation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6491 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9474-9481

Author(s):

Yichun Yin ◽

Lifeng Shang ◽

Xin Jiang ◽

Xiao Chen ◽

Qun Liu

Keyword(s):

Data Augmentation ◽

State Of The Art ◽

The State ◽

Training Data ◽

Quality Data ◽

Specific Context ◽

High Quality ◽

High Quality Data ◽

Fine Grained ◽

State Tracking

Neural dialog state trackers are generally limited due to the lack of quantity and diversity of annotated training data. In this paper, we address this difficulty by proposing a reinforcement learning (RL) based framework for data augmentation that can generate high-quality data to improve the neural state tracker. Specifically, we introduce a novel contextual bandit generator to learn fine-grained augmentation policies that can generate new effective instances by choosing suitable replacements for specific context. Moreover, by alternately learning between the generator and the state tracker, we can keep refining the generative policies to generate more high-quality training data for neural state tracker. Experimental results on the WoZ and MultiWoZ (restaurant) datasets demonstrate that the proposed framework significantly improves the performance over the state-of-the-art models, especially with limited training data.

Download Full-text

Neural language model based training data augmentation for weakly supervised early rumor detection

Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ◽

10.1145/3341161.3342892 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sooji Han ◽

Jie Gao ◽

Fabio Ciravegna

Keyword(s):

Data Augmentation ◽

Language Model ◽

Training Data ◽

Model Based ◽

Weakly Supervised ◽

Rumor Detection

Download Full-text

Data Augmentation for Low Resource Sentiment Analysis Using Generative Adversarial Networks

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2019.8682544 ◽

2019 ◽

Cited By ~ 1

Author(s):

Rahul Gupta

Keyword(s):

Sentiment Analysis ◽

Data Augmentation ◽

Generative Adversarial Networks ◽

Low Resource ◽

Adversarial Networks

Download Full-text

A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation

Information ◽

10.3390/info11050255 ◽

2020 ◽

Vol 11 (5) ◽

pp. 255

Author(s):

Yu Li ◽

Xiao Li ◽

Yating Yang ◽

Rui Dong

Keyword(s):

Machine Translation ◽

English Translation ◽

Data Augmentation ◽

Sampling Strategy ◽

Training Data ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Augmentation Strategy ◽

Diverse Data

One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German–English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English–Turkish, Nepali–English, and Sinhala–English translation tasks.

Download Full-text

Low-Resource Named Entity Recognition via the Pre-Training Model

Symmetry ◽

10.3390/sym13050786 ◽

2021 ◽

Vol 13 (5) ◽

pp. 786

Author(s):

Siqi Chen ◽

Yijie Pei ◽

Zunwang Ke ◽

Wushour Silamu

Keyword(s):

Data Augmentation ◽

Language Model ◽

Named Entity Recognition ◽

Name Entity Recognition ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

Low Resource ◽

Named Entity ◽

High Resource

Named entity recognition (NER) is an important task in the processing of natural language, which needs to determine entity boundaries and classify them into pre-defined categories. For low-resource languages, most state-of-the-art systems require tens of thousands of annotated sentences to obtain high performance. However, there is minimal annotated data available about Uyghur and Hungarian (UH languages) NER tasks. There are also specificities in each task—differences in words and word order across languages make it a challenging problem. In this paper, we present an effective solution to providing a meaningful and easy-to-use feature extractor for named entity recognition tasks: fine-tuning the pre-trained language model. Therefore, we propose a fine-tuning method for a low-resource language model, which constructs a fine-tuning dataset through data augmentation; then the dataset of a high-resource language is added; and finally the cross-language pre-trained model is fine-tuned on this dataset. In addition, we propose an attention-based fine-tuning strategy that uses symmetry to better select relevant semantic and syntactic information from pre-trained language models and apply these symmetry features to name entity recognition tasks. We evaluated our approach on Uyghur and Hungarian datasets, which showed wonderful performance compared to some strong baselines. We close with an overview of the available resources for named entity recognition and some of the open research questions.

Download Full-text

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Computational Intelligence and Neuroscience ◽

10.1155/2021/9975078 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Chenggang Mi ◽

Shaolin Zhu ◽

Rui Nie

Keyword(s):

Language Processing ◽

Data Augmentation ◽

Feature Fusion ◽

Training Data ◽

Low Resource ◽

High Resource ◽

Part Of Speech ◽

Word Level ◽

Cross Lingual ◽

Log Linear

Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.

Download Full-text