noisy text Latest Research Papers

Mixed script identification is a hindrance for automated natural language processing systems. Mixing cursive scripts of different languages is a challenge because NLP methods like POS tagging and word sense disambiguation suffer from noisy text. This study tackles the challenge of mixed script identification for mixed-code dataset consisting of Roman Urdu, Hindi, Saraiki, Bengali, and English. The language identification model is trained using word vectorization and RNN variants. Moreover, through experimental investigation, different architectures are optimized for the task associated with Long Short-Term Memory (LSTM), Bidirectional LSTM, Gated Recurrent Unit (GRU), and Bidirectional Gated Recurrent Unit (Bi-GRU). Experimentation achieved the highest accuracy of 90.17 for Bi-GRU, applying learned word class features along with embedding with GloVe. Moreover, this study addresses the issues related to multilingual environments, such as Roman words merged with English characters, generative spellings, and phonetic typing.

Download Full-text

Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3464361 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-19

Author(s):

Zolzaya Byambadorj ◽

Ryota Nishimura ◽

Altangerel Ayush ◽

Norihide Kitaoka

Keyword(s):

Social Media ◽

Performance Enhancement ◽

Statistical Machine Translation ◽

Training Data ◽

Daily Lives ◽

Latin Script ◽

N Gram ◽

Noisy Text ◽

Different Cultures ◽

Text Normalization

The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.

Download Full-text

Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text

Applied Sciences ◽

10.3390/app11178172 ◽

2021 ◽

Vol 11 (17) ◽

pp. 8172

Author(s):

Jebran Khan ◽

Sungchang Lee

Keyword(s):

Social Media ◽

Text Analysis ◽

State Of The Art ◽

Context Aware ◽

Intended Meaning ◽

Text Data ◽

Recall Accuracy ◽

Wide Range ◽

Textual Data ◽

Noisy Text

We proposed an application and data variations-independent, generic social media Textual Variations Handler (TVH) to deal with a wide range of noise in textual data generated in various social media (SM) applications for enhanced text analysis. The aim is to build an effective hybrid normalization technique that ensures the use of useful information of the noisy text in its intended form instead of filtering them out to analyze SM text better. The proposed TVH performs context-aware text normalization based on intended meaning to avoid the wrong word substitution. We integrate the TVH with state-of-the-art (SOTA) deep-learning-based text analysis methods to enhance their performance for noisy SM text data. The proposed scheme shows promising improvement in the text analysis of informal SM text in terms of precision, recall, accuracy, and F1-score in simulation.

Download Full-text

From General Language Understanding to Noisy Text Comprehension

Applied Sciences ◽

10.3390/app11177814 ◽

2021 ◽

Vol 11 (17) ◽

pp. 7814

Author(s):

Buddhika Kasthuriarachchy ◽

Madhu Chetty ◽

Adrian Shatte ◽

Darren Walls

Keyword(s):

Text Comprehension ◽

State Of The Art ◽

Language Model ◽

General Purpose ◽

Language Models ◽

Language Understanding ◽

English Usage ◽

Latent Representations ◽

Noisy Text ◽

Better Than

Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors.

Download Full-text

Bootstrapping Dependency Treebank of Urdu Noisy Text

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/12982021 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1102-1106

Keyword(s):

Data Driven ◽

Baseline Score ◽

Corpus Annotation ◽

Training Set ◽

Test Set ◽

Dependency Parser ◽

Noisy Text ◽

Manual Correction

This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text dependency treebank. To overcome the bottleneck of manually annotating corpus for a new domain of user-generated text, MaltParser, an opensource, data-driven dependency parser, is used to bootstrap the treebank in semi-automatic manner for corpus annotation after being trained on 500 tweet Urdu Noisy Text Dependency Treebank. Total four bootstrapping iterations were performed. At the end of each iteration, 300 Urdu tweets were automatically tagged, and the performance of parser model was evaluated against the development set. 75 automatically tagged tweets were randomly selected out of pre-tagged 300 tweets for manual correction, which were then added in the training set for parser retraining. Finally, at the end of last iteration, parser performance was evaluated against test set. The final supervised bootstrapping model obtains a LA of 72.1%, UAS of 75.7% and LAS of 64.9%, which is a significant improvement over baseline score of 69.8% LA, 74% UAS, and 62.9% LAS

Download Full-text

Universal Dependencies for Urdu Noisy Text

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/371032021 ◽

2021 ◽

Vol 10 (3) ◽

pp. 1751-1757

Keyword(s):

Social Media ◽

Cross Validation ◽

Weighted Kappa ◽

Annotation Scheme ◽

Text Annotation ◽

Annotation Process ◽

Social Media Text ◽

Noisy Text ◽

Total Agreement ◽

Fold Cross Validation

In this paper, the process of creating a Dependency Treebank for tweetsin Urdu,a morphologically rich and less-resourced languageis described. The 500 Urdu tweets treebank iscreated by manually annotating the treebank withlemma, POS tags, morphological and syntacticrelations using the Universal Dependencies annotation scheme, adopted to the peculiarities of Urdu social media text. annotation process is evaluated through Inter-annotator agreement for dependency relations and total agreement of 94.5% and resultant weighted Kappa = 0.876was observed. The treebank is evaluated through 10-fold cross validation using Maltparserwith various feature settings. Results show average UAS score of 74%, LAS score of 62.9% and LA score of 69.8%.

Download Full-text

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

Journal of Data and Information Quality ◽

10.1145/3418036 ◽

2021 ◽

Vol 13 (3) ◽

pp. 1-25

Author(s):

Anurag Roy ◽

Shalmoli Ghosh ◽

Kripabandhu Ghosh ◽

Saptarshi Ghosh

Keyword(s):

Large Fraction ◽

Training Data ◽

Writing Style ◽

Robust Algorithms ◽

Normalization Methods ◽

Textual Data ◽

Digitized Documents ◽

Noisy Text ◽

Text Normalization

A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.

Download Full-text

Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text

10.18653/v1/2021.ecnlp-1.21 ◽

2021 ◽

Author(s):

Kamal Kumar Gupta ◽

Soumya Chennabasavaraj ◽

Nikesh Garera ◽

Asif Ekbal

Keyword(s):

Product Review ◽

Parallel Corpus ◽

Corpus Creation ◽

Noisy Text

Download Full-text

Developing a POS Tagged Corpus of Urdu Tweets

Computers ◽

10.3390/computers9040090 ◽

2020 ◽

Vol 9 (4) ◽

pp. 90

Author(s):

Amber Baig ◽

Mutee U Rahman ◽

Hameedullah Kazi ◽

Ahsanullah Baloch

Keyword(s):

Social Media ◽

Language Processing ◽

Text Processing ◽

Media Content ◽

Pos Tagging ◽

Part Of Speech ◽

Social Media Text ◽

Traditional Natural ◽

Pos Tagger ◽

Noisy Text

Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure.

Download Full-text

Reconstructing Arguments from Noisy Text

Datenbank-Spektrum ◽

10.1007/s13222-020-00342-y ◽

2020 ◽

Vol 20 (2) ◽

pp. 123-129

Author(s):

Natalie Dykes ◽

Stefan Evert ◽

Merlin Göttlinger ◽

Philipp Heinrich ◽

Lutz Schröder

Keyword(s):

Noisy Text

Download Full-text

noisy text
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Mixed Script Identification Using Automated DNN Hyperparameter Optimization

Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data

Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text

From General Language Understanding to Noisy Text Comprehension

Bootstrapping Dependency Treebank of Urdu Noisy Text

Universal Dependencies for Urdu Noisy Text

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text

Developing a POS Tagged Corpus of Urdu Tweets

Reconstructing Arguments from Noisy Text

Export Citation Format

noisy textRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Mixed Script Identification Using Automated DNN Hyperparameter Optimization

Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data

Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text

From General Language Understanding to Noisy Text Comprehension

Bootstrapping Dependency Treebank of Urdu Noisy Text

Universal Dependencies for Urdu Noisy Text

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text

Developing a POS Tagged Corpus of Urdu Tweets

Reconstructing Arguments from Noisy Text

noisy text
Recently Published Documents