text normalization
Recently Published Documents


TOTAL DOCUMENTS

131
(FIVE YEARS 51)

H-INDEX

9
(FIVE YEARS 3)

Author(s):  
Zolzaya Byambadorj ◽  
Ryota Nishimura ◽  
Altangerel Ayush ◽  
Norihide Kitaoka

The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.


2021 ◽  
Author(s):  
Divya ◽  
Sunil Kumar ◽  
Debanjan Sadhya ◽  
Santosh Singh Rathore
Keyword(s):  

2021 ◽  
Author(s):  
Nana Mulyana Maghfur ◽  
Muhammad Okky Ibrohim ◽  
Junaedi Fahmi ◽  
Achmad Satria Putera ◽  
Oskar Riandi

2021 ◽  
Author(s):  
Benjamin M. Gyori ◽  
Charles Tapley Hoyt ◽  
Albert Steppi

AbstractSummaryGilda is a software tool and web service which implements a scored string matching algorithm for names and synonyms across entries in biomedical ontologies covering genes, proteins (and their families and complexes), small molecules, biological processes and diseases. Gilda integrates machine-learned disambiguation models to choose between ambiguous strings given relevant surrounding text as context, and supports species-prioritization in case of ambiguity.AvailabilityThe Gilda web service is available at http://grounding.indra.bio with source code, documentation and tutorials are available via https://github.com/indralab/[email protected]


2021 ◽  
Author(s):  
Yang Zhang ◽  
Evelina Bakhturina ◽  
Kyle Gorman ◽  
Boris Ginsburg
Keyword(s):  

Author(s):  
Monica Sunkara ◽  
Chaitanya Shivade ◽  
Sravan Bodapati ◽  
Katrin Kirchhoff
Keyword(s):  

2021 ◽  
Vol 13 (3) ◽  
pp. 1-25
Author(s):  
Anurag Roy ◽  
Shalmoli Ghosh ◽  
Kripabandhu Ghosh ◽  
Saptarshi Ghosh

A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.


2021 ◽  
Vol 35 (3) ◽  
pp. 193-205
Author(s):  
Oanh Thi Tran ◽  
Viet The Bui
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document