A survey of diacritic restoration in abjad and alphabet writing systems

2017 ◽  
Vol 24 (1) ◽  
pp. 123-154 ◽  
Author(s):  
FRANKLIN ỌLÁDIÍPỌ̀ ASAHIAH ◽  
ỌDẸ́TÚNJÍ ÀJÀDÍ ỌDẸ́JỌBÍ ◽  
EMMANUEL RÓTÌMÍ ADÁGÚNODÒ

AbstractA diacritic is a mark placed near or through a character to alter its original phonetic or orthographic value. Many languages around the world use diacritics in their orthography, whatever the writing system the orthography is based on. In many languages, diacritics are ignored either by convention or as a matter of convenience. For users who are not familiar with the text domain, the absence of diacritics within text has been known to cause mild to serious readability and comprehension problems. However, the absence of diacritics in text causes near-intractable problems for natural language processing systems. This situation has led to extensive research on diacritization. Several techniques have been applied to address diacritic restoration (or diacritization) but the existing surveys of techniques have been restricted to some languages and hence left gaps for practitioners to fill. Our survey examined diacritization from the angle of resources deployed and various formulation employed for diacritization. It was concluded by recommending that (a) any proposed technique for diacritization should consider the language features and the purpose served by diacritics, (b) that evaluation metrics needed to be more rigorously defined for easy comparison of performance of models.

Author(s):  
Longtu Zhang ◽  
Mamoru Komachi

Logographic and alphabetic languages (e.g., Chinese vs. English) have different writing systems linguistically. Languages belonging to the same writing system usually exhibit more sharing information, which can be used to facilitate natural language processing tasks such as neural machine translation (NMT). This article takes advantage of the logographic characters in Chinese and Japanese by decomposing them into smaller units, thus more optimally utilizing the information these characters share in the training of NMT systems in both encoding and decoding processes. Experiments show that the proposed method can robustly improve the NMT performance of both “logographic” language pairs (JA–ZH) and “logographic + alphabetic” (JA–EN and ZH–EN) language pairs in both supervised and unsupervised NMT scenarios. Moreover, as the decomposed sequences are usually very long, extra position features for the transformer encoder can help with the modeling of these long sequences. The results also indicate that, theoretically, linguistic features can be manipulated to obtain higher share token rates and further improve the performance of natural language processing systems.


Author(s):  
TIAN-SHUN YAO

With the word-based theory of natural language processing, a word-based Chinese language understanding system has been developed. In the light of psychological language analysis and the features of the Chinese language, this theory of natural language processing is presented with the description of the computer programs based on it. The heart of the system is to define a Total Information Dictionary and the World Knowledge Source used in the system. The purpose of this research is to develop a system which can understand not only Chinese sentences but also the whole text.


2020 ◽  
Author(s):  
David DeFranza ◽  
Himanshu Mishra ◽  
Arul Mishra

Language provides an ever-present context for our cognitions and has the ability to shape them. Languages across the world can be gendered (language in which the form of noun, verb, or pronoun is presented as female or male) versus genderless. In an ongoing debate, one stream of research suggests that gendered languages are more likely to display gender prejudice than genderless languages. However, another stream of research suggests that language does not have the ability to shape gender prejudice. In this research, we contribute to the debate by using a Natural Language Processing (NLP) method which captures the meaning of a word from the context in which it occurs. Using text data from Wikipedia and the Common Crawl project (which contains text from billions of publicly facing websites) across 45 world languages, covering the majority of the world’s population, we test for gender prejudice in gendered and genderless languages. We find that gender prejudice occurs more in gendered rather than genderless languages. Moreover, we examine whether genderedness of language influences the stereotypic dimensions of warmth and competence utilizing the same NLP method.


Author(s):  
Laura Buszard-Welcher

This chapter presents three technologies essential to enabling any language in the digital domain: language identifiers (ISO 639-3), Unicode (including fonts and keyboards), and the building of corpora to enable natural language processing. Just a few major languages of the world are well-enabled for use with electronically mediated communication. Another few hundred languages are arguably on their way to being well-enabled, if for market reasons alone. For all the remaining languages of the world, inclusion in the digital domain remains a distant possibility, and one that likely requires sustained interest, attention, and resources on the part of the language community itself. The good news is that the same technologies that enable the more widespread languages can also enable the less widespread, and even endangered ones, and bootstrapping is possible for all of them. The examples and resources described in this chapter can serve as inspiration and guidance in getting started.


News is a routine in everyone's life. It helps in enhancing the knowledge on what happens around the world. Fake news is a fictional information madeup with the intension to delude and hence the knowledge acquired becomes of no use. As fake news spreads extensively it has a negative impact in the society and so fake news detection has become an emerging research area. The paper deals with a solution to fake news detection using the methods, deep learning and Natural Language Processing. The dataset is trained using deep neural network. The dataset needs to be well formatted before given to the network which is made possible using the technique of Natural Language Processing and thus predicts whether a news is fake or not.


2021 ◽  
Author(s):  
AISDL

The meteoric rise of social media news during the ongoing COVID-19 is worthy of advanced research. Freedom of speech in many parts of the world, especially the developed countries and liberty of socialization, calls for noteworthy information sharing during the panic pandemic. However, as a communication intervention during crises in the past, social media use is remarkable; the Tweets generated via Twitter during the ongoing COVID-19 is incomparable with the former records. This study examines social media news trends and compares the Tweets on COVID-19 as a corpus from Twitter. By deploying Natural Language Processing (NLP) methods on tweets, we were able to extract and quantify the similarities between some tweets over time, which means that some people say the same thing about the pandemic while other Twitter users view it differently. The tools we used are Spacy, Networkx, WordCloud, and Re. This study contributes to the social media literature by understanding the similarity and divergence of COVID-19 tweets of the public and health agencies such as the World Health Organization (WHO). The study also sheds more light on the COVID-19 sparse and densely text network and their implications for the policymakers. The study explained the limitations and proposed future studies.


2016 ◽  
Vol 22 (3) ◽  
pp. 491-495 ◽  
Author(s):  
ROBERT DALE

AbstractTen years ago, Microsoft Word's grammar checker was really the only game in town. The software world, and the world of natural language processing, have changed a lot in that time, so what does the grammar checker marketplace have to offer today?


2018 ◽  
Vol 3 (1) ◽  
pp. 492
Author(s):  
Denis Cedeño Moreno ◽  
Miguel Vargas Lombardo

At present, the convergence of several areas of knowledge has led to the design and implementation of ICT systems that support the integration of heterogeneous tools, such as artificial intelligence (AI), statistics and databases (BD), among others. Ontologies in computing are included in the world of AI and refer to formal representations of an area of knowledge or domain. The discipline that is in charge of the study and construction of tools to accelerate the process of creation of ontologies from the natural language is the ontological engineering. In this paper, we propose a knowledge management model based on the clinical histories of patients (HC) in Panama, based on information extraction (EI), natural language processing (PLN) and the development of a domain ontology.Keywords: Knowledge, information extraction, ontology, automatic population of ontologies, natural language processing.


Sign in / Sign up

Export Citation Format

Share Document