2020 ◽  
Vol 0 (0) ◽  
Author(s):  
Fridah Katushemererwe ◽  
Andrew Caines ◽  
Paula Buttery

AbstractThis paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.


2018 ◽  
Vol 25 (4) ◽  
pp. 435-458
Author(s):  
Nadezhda S. Lagutina ◽  
Ksenia V. Lagutina ◽  
Aleksey S. Adrianov ◽  
Ilya V. Paramonov

The paper reviews the existing Russian-language thesauri in digital form and methods of their automatic construction and application. The authors analyzed the main characteristics of open access thesauri for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. The statistical and linguistic methods of thesaurus construction that allow to automate the development and reduce labor costs of expert linguists were studied. In particular, the authors considered algorithms for extracting keywords and semantic thesaurus relationships of all types, as well as the quality of thesauri generated with the use of these tools. To illustrate features of various methods for constructing thesaurus relationships, the authors developed a combined method that generates a specialized thesaurus fully automatically taking into account a text corpus in a particular domain and several existing linguistic resources. With the proposed method, experiments were conducted with two Russian-language text corpora from two subject areas: articles about migrants and tweets. The resulting thesauri were assessed by using an integrated assessment developed in the previous authors’ study that allows to analyze various aspects of the thesaurus and the quality of the generation methods. The analysis revealed the main advantages and disadvantages of various approaches to the construction of thesauri and the extraction of semantic relationships of different types, as well as made it possible to determine directions for future study.


2020 ◽  
Author(s):  
Mario Crespo Miguel

Computational linguistics is the scientific study of language from a computational perspective. It aims is to provide computational models of natural language processing (NLP) and incorporate them into practical applications such as speech synthesis, speech recognition, automatic translation and many others where automatic processing of language is required. The use of good linguistic resources is crucial for the development of computational linguistics systems. Real world applications need resources which systematize the way linguistic information is structured in a certain language. There is a continuous effort to increase the number of linguistic resources available for the linguistic and NLP Community. Most of the existing linguistic resources have been created for English, mainly because most modern approaches to computational lexical semantics emerged in the United States. This situation is changing over time and some of these projects have been subsequently extended to other languages; however, in all cases, much time and effort need to be invested in creating such resources. Because of this, one of the main purposes of this work is to investigate the possibility of extending these resources to other languages such as Spanish. In this work, we introduce some of the most important resources devoted to lexical semantics, such as WordNet or FrameNet, and those focusing on Spanish such as 3LB-LEX or Adesse. Of these, this project focuses on FrameNet. The project aims to document the range of semantic and syntactic combinatory possibilities of words in English. Words are grouped according to the different frames or situations evoked by their meaning. If we focus on a particular topic domain like medicine and we try to describe it in terms of FrameNet, we probably would obtain frames representing it like CURE, formed by words like cure.v, heal.v or palliative.a or MEDICAL CONDITIONS with lexical units such as arthritis.n, asphyxia.n or asthma.n. The purpose of this work is to develop an automatic means of selecting frames from a particular domain and to translate them into Spanish. As we have stated, we will focus on medicine. The selection of the medical frames will be corpus-based, that is, we will extract all the frames that are statistically significant from a representative corpus. We will discuss why using a corpus-based approach is a reliable and unbiased way of dealing with this task. We will present an automatic method for the selection of FrameNet frames and, in order to make sure that the results obtained are coherent, we will contrast them with a previous manual selection or benchmark. Outcomes will be analysed by using the F-score, a measure widely used in this type of applications. We obtained a 0.87 F-score according to our benchmark, which demonstrates the applicability of this type of automatic approaches. The second part of the book is devoted to the translation of this selection into Spanish. The translation will be made using EuroWordNet, a extension of the Princeton WordNet for some European languages. We will explore different ways to link the different units of our medical FrameNet selection to a certain WordNet synset or set of words that have similar meanings. Matching the frame units to a specific synset in EuroWordNet allows us both to translate them into Spanish and to add new terms provided by WordNet into FrameNet. The results show how translation can be done quite accurately (95.6%). We hope this work can add new insight into the field of natural language processing.


2007 ◽  
Vol 10 (1) ◽  
pp. 3-23 ◽  
Author(s):  
Eleni Efthimiou ◽  
Stavroula-Evita Fotinea ◽  
Galini Sapountzaki

The work reported in this study is based on research that has been carried out while developing a sign synthesis system for Greek Sign Language (GSL): theoretical linguistic analysis as well as lexicon and grammar resources derived from this analysis. We focus on the organisation of linguistic knowledge that initiates the multi-functional processing required to achieve sign generation performed by a virtual signer. In this context, structure rules and lexical coding support sign synthesis of GSL utterances, by exploitation of avatar technologies for the representation of the linguistic message. Sign generation involves two subsystems: a Greek-to-GSL conversion subsystem and a sign performance subsystem. The conversion subsystem matches input strings of written Greek-to-GSL structure patterns, exploiting Natural Language Processing (NLP) mechanisms. The sign performance subsystem uses parsed output of GSL structure patterns, enriched with sign-specific information, to activate a virtual signer for the performance of properly coded linguistic messages. Both the conversion and the synthesis procedure are based on adequately constructed electronic linguistic resources. Applicability of sign synthesis is demonstrated with the example of a Web-based prototype environment for GSL grammar teaching.


2021 ◽  
Vol 16 (1) ◽  
pp. 49
Author(s):  
Mario Crespo Miguel

<p>In the field of Natural Language Processing, linguistic resources are structured and detailed descriptions of a certain language. They are considered as key elements for studying languages and developing applications. However, these repositories are slow and difficult to build, and most of them focuses on English. This work tries to improve the lack of linguistic resources in Spanish by transferring part of the information encoded in the FrameNet project into Spanish. For this purpose, we developed an automatic procedure able to align the different frame predicates with the WordNet synsets that best represent them. Our system reaches an 88% precision and makes it possible to reuse this semantic resource for linguistic studies in Spanish.</p>


2020 ◽  
pp. 3-17
Author(s):  
Peter Nabende

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.


Diabetes ◽  
2019 ◽  
Vol 68 (Supplement 1) ◽  
pp. 1243-P
Author(s):  
JIANMIN WU ◽  
FRITHA J. MORRISON ◽  
ZHENXIANG ZHAO ◽  
XUANYAO HE ◽  
MARIA SHUBINA ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document