Enhanced Text Stemmer with Noisy Text Normalization for Malay Texts

Author(s):  
Mohamad Nizam Kassim ◽  
Shaiful Hisham Mat Jali ◽  
Mohd Aizaini Maarof ◽  
Anazida Zainal ◽  
Amirudin Abdul Wahab

Author(s):  
Zolzaya Byambadorj ◽  
Ryota Nishimura ◽  
Altangerel Ayush ◽  
Norihide Kitaoka

The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.



2019 ◽  
Vol 36 (5) ◽  
pp. 4921-4929 ◽  
Author(s):  
Manuel Mager ◽  
Mónica Jasso Rosales ◽  
Özlem Çetinoğlu ◽  
Ivan Meza


2021 ◽  
Vol 13 (3) ◽  
pp. 1-25
Author(s):  
Anurag Roy ◽  
Shalmoli Ghosh ◽  
Kripabandhu Ghosh ◽  
Saptarshi Ghosh

A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.





Author(s):  
Ernest Pusateri ◽  
Bharat Ram Ambati ◽  
Elizabeth Brooks ◽  
Ondrej Platek ◽  
Donald McAllaster ◽  
...  


2021 ◽  
Author(s):  
Divya ◽  
Sunil Kumar ◽  
Debanjan Sadhya ◽  
Santosh Singh Rathore
Keyword(s):  


2021 ◽  
Author(s):  
Yang Zhang ◽  
Evelina Bakhturina ◽  
Kyle Gorman ◽  
Boris Ginsburg
Keyword(s):  




Sign in / Sign up

Export Citation Format

Share Document