Automatic corpus-based translation of a spanish framenet medical glossary

2020 ◽  
Author(s):  
Mario Crespo Miguel

Computational linguistics is the scientific study of language from a computational perspective. It aims is to provide computational models of natural language processing (NLP) and incorporate them into practical applications such as speech synthesis, speech recognition, automatic translation and many others where automatic processing of language is required. The use of good linguistic resources is crucial for the development of computational linguistics systems. Real world applications need resources which systematize the way linguistic information is structured in a certain language. There is a continuous effort to increase the number of linguistic resources available for the linguistic and NLP Community. Most of the existing linguistic resources have been created for English, mainly because most modern approaches to computational lexical semantics emerged in the United States. This situation is changing over time and some of these projects have been subsequently extended to other languages; however, in all cases, much time and effort need to be invested in creating such resources. Because of this, one of the main purposes of this work is to investigate the possibility of extending these resources to other languages such as Spanish. In this work, we introduce some of the most important resources devoted to lexical semantics, such as WordNet or FrameNet, and those focusing on Spanish such as 3LB-LEX or Adesse. Of these, this project focuses on FrameNet. The project aims to document the range of semantic and syntactic combinatory possibilities of words in English. Words are grouped according to the different frames or situations evoked by their meaning. If we focus on a particular topic domain like medicine and we try to describe it in terms of FrameNet, we probably would obtain frames representing it like CURE, formed by words like cure.v, heal.v or palliative.a or MEDICAL CONDITIONS with lexical units such as arthritis.n, asphyxia.n or asthma.n. The purpose of this work is to develop an automatic means of selecting frames from a particular domain and to translate them into Spanish. As we have stated, we will focus on medicine. The selection of the medical frames will be corpus-based, that is, we will extract all the frames that are statistically significant from a representative corpus. We will discuss why using a corpus-based approach is a reliable and unbiased way of dealing with this task. We will present an automatic method for the selection of FrameNet frames and, in order to make sure that the results obtained are coherent, we will contrast them with a previous manual selection or benchmark. Outcomes will be analysed by using the F-score, a measure widely used in this type of applications. We obtained a 0.87 F-score according to our benchmark, which demonstrates the applicability of this type of automatic approaches. The second part of the book is devoted to the translation of this selection into Spanish. The translation will be made using EuroWordNet, a extension of the Princeton WordNet for some European languages. We will explore different ways to link the different units of our medical FrameNet selection to a certain WordNet synset or set of words that have similar meanings. Matching the frame units to a specific synset in EuroWordNet allows us both to translate them into Spanish and to add new terms provided by WordNet into FrameNet. The results show how translation can be done quite accurately (95.6%). We hope this work can add new insight into the field of natural language processing.

Author(s):  
Kevin Bretonnel Cohen

Computational linguistics has its origins in the post-Second World War research on translation of Russian-language scientific journal articles in the United States. Today, biomedical natural language processing treats clinical data, the scientific literature, and social media, with use cases ranging from studying adverse effects of drugs to interpreting high-throughput genomic assays (Névéol and Zweigenbaum 2018). Many of the most prominent research areas in the field involve extracting information from text and normalizing it to enormous databases of domain-relevant semantic classes, such as genes, diseases, and biological processes. Moving forward, the field is expected to play a significant role in understanding reproducibility in natural language processing.


2020 ◽  
Vol 0 (0) ◽  
Author(s):  
Fridah Katushemererwe ◽  
Andrew Caines ◽  
Paula Buttery

AbstractThis paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.


Author(s):  
Krzysztof Fiok ◽  
Waldemar Karwowski ◽  
Edgar Gutierrez ◽  
Maham Saeidi ◽  
Awad M. Aljuaid ◽  
...  

The COVID-19 pandemic has changed our lifestyles, habits, and daily routine. Some of the impacts of COVID-19 have been widely reported already. However, many effects of the COVID-19 pandemic are still to be discovered. The main objective of this study was to assess the changes in the frequency of reported physical back pain complaints reported during the COVID-19 pandemic. In contrast to other published studies, we target the general population using Twitter as a data source. Specifically, we aim to investigate differences in the number of back pain complaints between the pre-pandemic and during the pandemic. A total of 53,234 and 78,559 tweets were analyzed for November 2019 and November 2020, respectively. Because Twitter users do not always complain explicitly when they tweet about the experience of back pain, we have designed an intelligent filter based on natural language processing (NLP) to automatically classify the examined tweets into the back pain complaining class and other tweets. Analysis of filtered tweets indicated an 84% increase in the back pain complaints reported in November 2020 compared to November 2019. These results might indicate significant changes in lifestyle during the COVID-19 pandemic, including restrictions in daily body movements and reduced exposure to routine physical exercise.


2021 ◽  
pp. 016327872110469
Author(s):  
Peter Baldwin ◽  
Janet Mee ◽  
Victoria Yaneva ◽  
Miguel Paniagua ◽  
Jean D’Angelo ◽  
...  

One of the most challenging aspects of writing multiple-choice test questions is identifying plausible incorrect response options—i.e., distractors. To help with this task, a procedure is introduced that can mine existing item banks for potential distractors by considering the similarities between a new item’s stem and answer and the stems and response options for items in the bank. This approach uses natural language processing to measure similarity and requires a substantial pool of items for constructing the generating model. The procedure is demonstrated with data from the United States Medical Licensing Examination (USMLE®). For about half the items in the study, at least one of the top three system-produced candidates matched a human-produced distractor exactly; and for about one quarter of the items, two of the top three candidates matched human-produced distractors. A study was conducted in which a sample of system-produced candidates were shown to 10 experienced item writers. Overall, participants thought about 81% of the candidates were on topic and 56% would help human item writers with the task of writing distractors.


2021 ◽  
Author(s):  
Monique B. Sager ◽  
Aditya M. Kashyap ◽  
Mila Tamminga ◽  
Sadhana Ravoori ◽  
Christopher Callison-Burch ◽  
...  

BACKGROUND Reddit, the fifth most popular website in the United States, boasts a large and engaged user base on its dermatology forums where users crowdsource free medical opinions. Unfortunately, much of the advice provided is unvalidated and could lead to inappropriate care. Initial testing has shown that artificially intelligent bots can detect misinformation on Reddit forums and may be able to produce responses to posts containing misinformation. OBJECTIVE To analyze the ability of bots to find and respond to health misinformation on Reddit’s dermatology forums in a controlled test environment. METHODS Using natural language processing techniques, we trained bots to target misinformation using relevant keywords and to post pre-fabricated responses. By evaluating different model architectures across a held-out test set, we compared performances. RESULTS Our models yielded data test accuracies ranging from 95%-100%, with a BERT fine-tuned model resulting in the highest level of test accuracy. Bots were then able to post corrective pre-fabricated responses to misinformation. CONCLUSIONS Using a limited data set, bots had near-perfect ability to detect these examples of health misinformation within Reddit dermatology forums. Given that these bots can then post pre-fabricated responses, this technique may allow for interception of misinformation. Providing correct information, even instantly, however, does not mean users will be receptive or find such interventions persuasive. Further work should investigate this strategy’s effectiveness to inform future deployment of bots as a technique in combating health misinformation. CLINICALTRIAL N/A


Author(s):  
Mans Hulden

Finite-state machines—automata and transducers—are ubiquitous in natural-language processing and computational linguistics. This chapter introduces the fundamentals of finite-state automata and transducers, both probabilistic and non-probabilistic, illustrating the technology with example applications and common usage. It also covers the construction of transducers, which correspond to regular relations, and automata, which correspond to regular languages. The technologies introduced are widely employed in natural language processing, computational phonology and morphology in particular, and this is illustrated through common practical use cases.


Author(s):  
Sourajit Roy ◽  
Pankaj Pathak ◽  
S. Nithya

During the advent of the 21st century, technical breakthroughs and developments took place. Natural Language Processing or NLP is one of their promising disciplines that has been increasingly dynamic via groundbreaking findings on most computer networks. Because of the digital revolution the amounts of data generated by M2M communication across devices and platforms such as Amazon Alexa, Apple Siri, Microsoft Cortana, etc. were significantly increased. This causes a great deal of unstructured data to be processed that does not fit in with standard computational models. In addition, the increasing problems of language complexity, data variability and voice ambiguity make implementing models increasingly harder. The current study provides an overview of the potential and breadth of the NLP market and its acceptance in industry-wide, in particular after Covid-19. It also gives a macroscopic picture of progress in natural language processing research, development and implementation.


Author(s):  
Ayush Srivastav ◽  
Hera Khan ◽  
Amit Kumar Mishra

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.


Sign in / Sign up

Export Citation Format

Share Document