A LEXICON BASED ALGORITHM FOR NOISY TEXT NORMALIZATION AS PRE-PROCESSING FOR SENTIMENT ANALYSIS

Sudipta Roy .

doi:10.15623/ijret.2013.0214013

An Enhancement of Malay Social Media Text Normalization for Lexicon-Based Sentiment Analysis

2019 International Conference on Asian Language Processing (IALP) ◽

10.1109/ialp48816.2019.9037700 ◽

2019 ◽

Author(s):

Muhammad Fakhrur Razi Abu Bakar ◽

Norisma Idris ◽

Liyana Shuib

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Social Media Text ◽

Text Normalization

Enhanced Text Stemmer with Noisy Text Normalization for Malay Texts

Smart Innovation, Systems and Technologies - Smart Trends in Computing and Communications ◽

10.1007/978-981-15-0077-0_44 ◽

2019 ◽

pp. 433-444

Author(s):

Mohamad Nizam Kassim ◽

Shaiful Hisham Mat Jali ◽

Mohd Aizaini Maarof ◽

Anazida Zainal ◽

Amirudin Abdul Wahab

Keyword(s):

Noisy Text ◽

Text Normalization

IoT-Based Pervasive Sentiment Analysis: A Fine-Grained Text Normalization Framework for Context Aware Hybrid Applications

10.1007/978-3-030-75123-4_10 ◽

2021 ◽

pp. 201-226

Author(s):

Asad Habib ◽

Arslan Ali Raza

Keyword(s):

Sentiment Analysis ◽

Context Aware ◽

Fine Grained ◽

Text Normalization

Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3464361 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-19

Author(s):

Zolzaya Byambadorj ◽

Ryota Nishimura ◽

Altangerel Ayush ◽

Norihide Kitaoka

Keyword(s):

Social Media ◽

Performance Enhancement ◽

Statistical Machine Translation ◽

Training Data ◽

Daily Lives ◽

Latin Script ◽

N Gram ◽

Noisy Text ◽

Different Cultures ◽

Text Normalization

The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.

Low-resource neural character-based noisy text normalization

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-179039 ◽

2019 ◽

Vol 36 (5) ◽

pp. 4921-4929 ◽

Cited By ~ 1

Author(s):

Manuel Mager ◽

Mónica Jasso Rosales ◽

Özlem Çetinoğlu ◽

Ivan Meza

Keyword(s):

Low Resource ◽

Noisy Text ◽

Text Normalization

Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis

Social Network Analysis and Mining ◽

10.1007/s13278-019-0557-y ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 19

Author(s):

Monika Arora ◽

Vineet Kansal

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Sentiment Analysis ◽

Deep Convolutional Neural Network ◽

Unstructured Data ◽

Text Normalization

Text normalization of code mix and sentiment analysis

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) ◽

10.1109/icacci.2015.7275819 ◽

2015 ◽

Cited By ~ 7

Author(s):

Shashank Sharma ◽

PYKL Srinivas ◽

Rakesh Chandra Balabantaray

Keyword(s):

Sentiment Analysis ◽

Text Normalization

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

Journal of Data and Information Quality ◽

10.1145/3418036 ◽

2021 ◽

Vol 13 (3) ◽

pp. 1-25

Author(s):

Anurag Roy ◽

Shalmoli Ghosh ◽

Kripabandhu Ghosh ◽

Saptarshi Ghosh

Keyword(s):

Large Fraction ◽

Training Data ◽

Writing Style ◽

Robust Algorithms ◽

Normalization Methods ◽

Textual Data ◽

Digitized Documents ◽

Noisy Text ◽

Text Normalization

A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.

Study of Sentiment of Governor's Election Opinion in 2018

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset21841124 ◽

2018 ◽

pp. 231-238

Author(s):

Agung Eddy Suryo Saputro ◽

Khairil Anwar Notodiputro ◽

Indahwati A

Keyword(s):

Sentiment Analysis ◽

Naive Bayes ◽

Naïve Bayes ◽

Addition Method ◽

Sentiment Mining ◽

Positive Sentiment ◽

An historical analysis of species references in American English

Corpora ◽

10.3366/cor.2019.0177 ◽

2019 ◽

Vol 14 (3) ◽

pp. 327-349

Author(s):

Craig Frayne

Keyword(s):

Environmental Change ◽

Sentiment Analysis ◽

Quantitative Methods ◽

English Language ◽

Language Use ◽

American English ◽

Historical Analysis ◽

The Past ◽

Corpus Studies ◽

Google Books

This study uses the two largest available American English language corpora, Google Books and the Corpus of Historical American English (coha), to investigate relations between ecology and language. The paper introduces ecolinguistics as a promising theme for corpus research. While some previous ecolinguistic research has used corpus approaches, there is a case to be made for quantitative methods that draw on larger datasets. Building on other corpus studies that have made connections between language use and environmental change, this paper investigates whether linguistic references to other species have changed in the past two centuries and, if so, how. The methodology consists of two main parts: an examination of the frequency of common names of species followed by aspect-level sentiment analysis of concordance lines. Results point to both opportunities and challenges associated with applying corpus methods to ecolinguistc research.