BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language

PeerJ Computer Science ◽

10.7717/peerj-cs.681 ◽

2021 ◽

Vol 7 ◽

pp. e681

Author(s):

Salim Sazzed

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Language Processing ◽

Supervised Machine Learning ◽

Translation System ◽

Low Resource ◽

Part Of Speech ◽

Sentiment Lexicon ◽

Cross Lingual ◽

Bengali Language

Bengali is a low-resource language that lacks tools and resources for various natural language processing (NLP) tasks, such as sentiment analysis or profanity identification. In Bengali, only the translated versions of English sentiment lexicons are available. Moreover, no dictionary exists for detecting profanity in Bengali social media text. This study introduces a Bengali sentiment lexicon, BengSentiLex, and a Bengali swear lexicon, BengSwearLex. For creating BengSentiLex, a cross-lingual methodology is proposed that utilizes a machine translation system, a review corpus, two English sentiment lexicons, pointwise mutual information (PMI), and supervised machine learning (ML) classifiers in various stages. A semi-automatic methodology is presented to develop BengSwearLex that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The performance of BengSentiLex compared with the translated English lexicons in three evaluation datasets. BengSentiLex achieves 5%–50% improvement over the translated lexicons. For identifying profanity, BengSwearLex achieves documentlevel coverage of around 85% in an document-level in the evaluation dataset. The experimental results imply that BengSentiLex and BengSwearLex are effective resources for classifying sentiment and identifying profanity in Bengali social media content, respectively.

Download Full-text

Deep Persian sentiment analysis: Cross-lingual training for low-resource languages

Journal of Information Science ◽

10.1177/0165551520962781 ◽

2020 ◽

pp. 016555152096278

Author(s):

Rouzbeh Ghasemi ◽

Seyed Arad Ashrafi Asli ◽

Saeedeh Momtazi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Low Resource ◽

Proposed Model ◽

Significant Difference ◽

Cross Lingual

With the advent of deep neural models in natural language processing tasks, having a large amount of training data plays an essential role in achieving accurate models. Creating valid training data, however, is a challenging issue in many low-resource languages. This problem results in a significant difference between the accuracy of available natural language processing tools for low-resource languages compared with rich languages. To address this problem in the sentiment analysis task in the Persian language, we propose a cross-lingual deep learning framework to benefit from available training data of English. We deployed cross-lingual embedding to model sentiment analysis as a transfer learning model which transfers a model from a rich-resource language to low-resource ones. Our model is flexible to use any cross-lingual word embedding model and any deep architecture for text classification. Our experiments on English Amazon dataset and Persian Digikala dataset using two different embedding models and four different classification networks show the superiority of the proposed model compared with the state-of-the-art monolingual techniques. Based on our experiment, the performance of Persian sentiment analysis improves 22% in static embedding and 9% in dynamic embedding. Our proposed model is general and language-independent; that is, it can be used for any low-resource language, once a cross-lingual embedding is available for the source–target language pair. Moreover, by benefitting from word-aligned cross-lingual embedding, the only required data for a reliable cross-lingual embedding is a bilingual dictionary that is available between almost all languages and the English language, as a potential source language.

Download Full-text

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Computational Intelligence and Neuroscience ◽

10.1155/2021/9975078 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Chenggang Mi ◽

Shaolin Zhu ◽

Rui Nie

Keyword(s):

Language Processing ◽

Data Augmentation ◽

Feature Fusion ◽

Training Data ◽

Low Resource ◽

High Resource ◽

Part Of Speech ◽

Word Level ◽

Cross Lingual ◽

Log Linear

Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.

Download Full-text

Sentiment Analysis of Tweets Using Hadoop

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.12.16123 ◽

2018 ◽

Vol 7 (3.12) ◽

pp. 434

Author(s):

Pranav Seth ◽

Apoorv Sharma ◽

R Vidhya

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Language Processing ◽

Software Toolkit ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

The World ◽

Media Channels ◽

Speech Tagging ◽

Source Of Information

Blogging and networking platforms like Facebook, Reddit, Twitter and LinkedIn are social media channels where users can share their thoughts and opinions. Since online chatter is a vital and exhaustive source of information, these thoughts and opinions hold the key to the success of any endeavour. Tweets which are posted by millions all over the world can be used to analyse consumers’ opinions about individual products, services and campaigns. These tweets have proven to be a valuable source of information in the recent years, playing key roles in success of brands, businesses and politicians. We have tackled Sentiment Analysis with a lexicon-based approach for extracting positive, negative, and neutral tweets by using part-of-speech tagging from natural language processing. The approach manifests in the design of a software toolkit that facilitates the sentiment analysis. We collect dataset, i.e. the tweets are fetched from Twitter and text mining techniques like tokenization are executed to use it for building classifier that is able to predict sentiments for each tweet.

Download Full-text

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics ◽

10.3390/electronics10121372 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1372

Author(s):

Sanjanasri JP ◽

Vijay Krishna Menon ◽

Soman KP ◽

Rajendran S ◽

Agnieszka Wolk

Keyword(s):

Deep Learning ◽

Language Processing ◽

Semantic Space ◽

Semantic Interpretation ◽

Learning Approaches ◽

Qualitative Comparison ◽

Bilingual Dictionary ◽

Pos Tagging ◽

Part Of Speech ◽

Cross Lingual

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

Download Full-text

Two New Large Corpora for Vietnamese Aspect-based Sentiment Analysis at Sentence Level

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3446678 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-22

Author(s):

Dang Van Thin ◽

Ngan Luu-Thuy Nguyen ◽

Tri Minh Truong ◽

Lac Si Le ◽

Duy Tin Vo

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

Network Architectures ◽

Low Resource ◽

The Neural Network ◽

Sentence Level ◽

Push Forward ◽

Polarity Classification ◽

Learning Architectures ◽

Single Approach

Aspect-based sentiment analysis has been studied in both research and industrial communities over recent years. For the low-resource languages, the standard benchmark corpora play an important role in the development of methods. In this article, we introduce two benchmark corpora with the largest sizes at sentence-level for two tasks: Aspect Category Detection and Aspect Polarity Classification in Vietnamese. Our corpora are annotated with high inter-annotator agreements for the restaurant and hotel domains. The release of our corpora would push forward the low-resource language processing community. In addition, we deploy and compare the effectiveness of supervised learning methods with a single and multi-task approach based on deep learning architectures. Experimental results on our corpora show that the multi-task approach based on BERT architecture outperforms the neural network architectures and the single approach. Our corpora and source code are published on this footnoted site. 1

Download Full-text

Análise de discursos em notícias sobre homofobia, racismo e sexismo em comentários de portais brasileiros de notícias

10.14210/cotb.v12.p467-474 ◽

2021 ◽

Author(s):

Lucas Rodrigues ◽

Antonio Jacob Junior ◽

Fábio Lobato

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Sentiment Analysis ◽

Data Visualization ◽

Language Processing ◽

Topic Modeling ◽

Hate Speech ◽

Psychological Impact ◽

Internet Service ◽

General Law

Posts with defamatory content or hate speech are constantly foundon social media. The results for readers are numerous, not restrictedonly to the psychological impact, but also to the growth of thissocial phenomenon. With the General Law on the Protection ofPersonal Data and the Marco Civil da Internet, service providersbecame responsible for the content in their platforms. Consideringthe importance of this issue, this paper aims to analyze the contentpublished (news and comments) on the G1 News Portal with techniquesbased on data visualization and Natural Language Processing,such as sentiment analysis and topic modeling. The results showthat even with most of the comments being neutral or negative andclassified or not as hate speech, the majority of them were acceptedby the users.

Download Full-text

SENTIMENT ANALYSIS FOR SARCASTIC MESSAGES IN SOCIAL MEDIA USING DEEP LEARNING TECHNIQUES - AN EMPIRICAL STUDY

INFORMATION TECHNOLOGY IN INDUSTRY ◽

10.17762/itii.v9i2.451 ◽

2021 ◽

Vol 9 (2) ◽

pp. 1051-1052

Author(s):

K. Kavitha, Et. al.

Keyword(s):

Social Media ◽

Deep Learning ◽

Empirical Study ◽

Sentiment Analysis ◽

Learning Strategies ◽

Language Processing ◽

The People ◽

Learning Techniques ◽

Key Terms ◽

Learning Frameworks

Sentiments is the term of opinion or views about any topic expressed by the people through a source of communication. Nowadays social media is an effective platform for people to communicate and it generates huge amount of unstructured details every day. It is essential for any business organization in the current era to process and analyse the sentiments by using machine learning and Natural Language Processing (NLP) strategies. Even though in recent times the deep learning strategies are becoming more familiar due to higher capabilities of performance. This paper represents an empirical study of an application of deep learning techniques in Sentiment Analysis (SA) for sarcastic messages and their increasing scope in real time. Taxonomy of the sentiment analysis in recent times and their key terms are also been highlighted in the manuscript. The survey concludes the recent datasets considered, their key contributions and the performance of deep learning model applied with its primary purpose like sarcasm detection in order to describe the efficiency of deep learning frameworks in the domain of sentimental analysis.

Download Full-text

Quantitative Methods for Analyzing Intimate Partner Violence in Microblogs: Observational Study

Journal of Medical Internet Research ◽

10.2196/15347 ◽

2020 ◽

Vol 22 (11) ◽

pp. e15347

Author(s):

Christopher Michael Homan ◽

J Nicolas Schrading ◽

Raymond W Ptucha ◽

Catherine Cerulli ◽

Cecilia Ovesdotter Alm

Keyword(s):

Machine Learning ◽

Social Media ◽

Intimate Partner Violence ◽

Language Processing ◽

Partner Violence ◽

Quantitative Methods ◽

Intimate Partner ◽

Support Vector ◽

Data Set ◽

Part Of Speech

Background Social media is a rich, virtually untapped source of data on the dynamics of intimate partner violence, one that is both global in scale and intimate in detail. Objective The aim of this study is to use machine learning and other computational methods to analyze social media data for the reasons victims give for staying in or leaving abusive relationships. Methods Human annotation, part-of-speech tagging, and machine learning predictive models, including support vector machines, were used on a Twitter data set of 8767 #WhyIStayed and #WhyILeft tweets each. Results Our methods explored whether we can analyze micronarratives that include details about victims, abusers, and other stakeholders, the actions that constitute abuse, and how the stakeholders respond. Conclusions Our findings are consistent across various machine learning methods, which correspond to observations in the clinical literature, and affirm the relevance of natural language processing and machine learning for exploring issues of societal importance in social media.

Download Full-text

Aspect-Based Sentiment Analysis of Online Product Reviews

Advances in Business Information Systems and Analytics - Handbook of Research on Advanced Data Mining Techniques and Applications for Business Intelligence ◽

10.4018/978-1-5225-2031-3.ch010 ◽

2017 ◽

pp. 175-191 ◽

Cited By ~ 1

Author(s):

Vinod Kumar Mishra ◽

Himanshu Tiruwa

Keyword(s):

Sentiment Analysis ◽

Computational Linguistics ◽

Language Processing ◽

Future Trend ◽

Product Reviews ◽

Customer Reviews ◽

Part Of Speech ◽

Lexical Approach ◽

Sentiment Score ◽

Pos Tagger

Sentiment analysis is a part of computational linguistics concerned with extracting sentiment and emotion from text. It is also considered as a task of natural language processing and data mining. Sentiment analysis mainly concentrate on identifying whether a given text is subjective or objective and if it is subjective, then whether it is negative, positive or neutral. This chapter provide an overview of aspect based sentiment analysis with current and future trend of research on aspect based sentiment analysis. This chapter also provide a aspect based sentiment analysis of online customer reviews of Nokia 6600. To perform aspect based classification we are using lexical approach on eclipse platform which classify the review as a positive, negative or neutral on the basis of features of product. The Sentiwordnet is used as a lexical resource to calculate the overall sentiment score of each sentence, pos tagger is used for part of speech tagging, frequency based method is used for extraction of the aspects/features and used negation handling for improving the accuracy of the system.

Download Full-text

Discovery of Sustainable Transport Modes Underlying TripAdvisor Reviews With Sentiment Analysis

Advances in Business Information Systems and Analytics - Natural Language Processing for Global and Local Business ◽

10.4018/978-1-7998-4240-8.ch008 ◽

2021 ◽

pp. 180-199

Author(s):

Ainhoa Serna ◽

Jon Kepa Gerrikagoitia

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Language Processing ◽

Predictive Analytics ◽

Data Gathering ◽

Point Of View ◽

Training Data ◽

Complete Analysis ◽

Sustainable Transport ◽

Transport Modes

In recent years, digital technology and research methods have developed natural language processing for better understanding consumers and what they share in social media. There are hardly any studies in transportation analysis with TripAdvisor, and moreover, there is not a complete analysis from the point of view of sentiment analysis. The aim of study is to investigate and discover the presence of sustainable transport modes underlying in non-categorized TripAdvisor texts, such as walking mobility in order to impact positively in public services and businesses. The methodology follows a quantitative and qualitative approach based on knowledge discovery techniques. Thus, data gathering, normalization, classification, polarity analysis, and labelling tasks have been carried out to obtain sentiment labelled training data set in the transport domain as a valuable contribution for predictive analytics. This research has allowed the authors to discover sustainable transport modes underlying the texts, focused on walking mobility but extensible to other means of transport and social media sources.

Download Full-text