scholarly journals Multilingual Facilitation

2021 ◽  

This is the Festschrift of Dr. Jack Rueter. The book presents peer-reviewed scientific work from Dr. Rueter’s colleagues related to the latest advances in natural language processing, digital resources and endangered languages in a variety of languages such as historical English, Chukchi, Mansi, Erzya, Komi, Finnish, Apurina, Sign Languages, Sami languages and Japanese. Most of the papers present work on endangered languages or on domains with a limited number of resources available for NLP. This book collects original and insightful papers from well-established researchers in NLP, linguistics, philology and digital humanities. This is a tribute to Dr. Rueter’s long career that is characterized by constant altruistic work towards a greater good in building free and open-source tools and resources for endangered languages. Dr. Rueter is a true pioneer in the field of digital documentation of endangered languages.

2020 ◽  
Vol 0 (0) ◽  
Author(s):  
Fridah Katushemererwe ◽  
Andrew Caines ◽  
Paula Buttery

AbstractThis paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.


2019 ◽  
Author(s):  
Daniel M. Bean ◽  
James Teo ◽  
Honghan Wu ◽  
Ricardo Oliveira ◽  
Raj Patel ◽  
...  

AbstractAtrial fibrillation (AF) is the most common arrhythmia and significantly increases stroke risk. This risk is effectively managed by oral anticoagulation. Recent studies using national registry data indicate increased use of anticoagulation resulting from changes in guidelines and the availability of newer drugs.The aim of this study is to develop and validate an open source risk scoring pipeline for free-text electronic health record data using natural language processing.AF patients discharged from 1st January 2011 to 1st October 2017 were identified from discharge summaries (N=10,030, 64.6% male, average age 75.3 ± 12.3 years). A natural language processing pipeline was developed to identify risk factors in clinical text and calculate risk for ischaemic stroke (CHA2DS2-VASc) and bleeding (HAS-BLED). Scores were validated vs two independent experts for 40 patients.Automatic risk scores were in strong agreement with the two independent experts for CHA2DS2-VASc (average kappa 0.78 vs experts, compared to 0.85 between experts). Agreement was lower for HAS-BLED (average kappa 0.54 vs experts, compared to 0.74 between experts).In high-risk patients (CHA2DS2-VASc ≥2) OAC use has increased significantly over the last 7 years, driven by the availability of DOACs and the transitioning of patients from AP medication alone to OAC. Factors independently associated with OAC use included components of the CHA2DS2-VASc and HAS-BLED scores as well as discharging specialty and frailty. OAC use was highest in patients discharged under cardiology (69%).Electronic health record text can be used for automatic calculation of clinical risk scores at scale. Open source tools are available today for this task but require further validation. Analysis of routinely-collected EHR data can replicate findings from large-scale curated registries.


Author(s):  
Laura Buszard-Welcher

This chapter presents three technologies essential to enabling any language in the digital domain: language identifiers (ISO 639-3), Unicode (including fonts and keyboards), and the building of corpora to enable natural language processing. Just a few major languages of the world are well-enabled for use with electronically mediated communication. Another few hundred languages are arguably on their way to being well-enabled, if for market reasons alone. For all the remaining languages of the world, inclusion in the digital domain remains a distant possibility, and one that likely requires sustained interest, attention, and resources on the part of the language community itself. The good news is that the same technologies that enable the more widespread languages can also enable the less widespread, and even endangered ones, and bootstrapping is possible for all of them. The examples and resources described in this chapter can serve as inspiration and guidance in getting started.


2018 ◽  
Vol 11 (3) ◽  
pp. 1-25
Author(s):  
Leonel Figueiredo de Alencar ◽  
Bruno Cuconato ◽  
Alexandre Rademaker

ABSTRACT: One of the prerequisites for many natural language processing technologies is the availability of large lexical resources. This paper reports on MorphoBr, an ongoing project aiming at building a comprehensive full-form lexicon for morphological analysis of Portuguese. A first version of the resource is already freely available online under an open source, free software license. MorphoBr combines analogous free resources, correcting several thousand errors and gaps, and systematically adding new entries. In comparison to the integrated resources, lexical entries in MorphoBr follow a more user-friendly format, which can be straightforwardly compiled into finite-state transducers for morphological analysis, e.g. in the context of syntactic parsing with a grammar in the LFG formalism using the XLE system. MorphoBr results from a combination of computational techniques. Errors and the more obvious gaps in the integrated resources were automatically corrected with scripts. However, MorphoBr's main contribution is the expansion in the inventory of nouns and adjectives. This was carried out by systematically modeling diminutive formation in the paradigm of finite-state morphology. This allowed MorphoBr to significantly outperform analogous resources in the coverage of diminutives. The first evaluation results show MorphoBr to be a promising initiative which will directly contribute to the development of more robust natural language processing tools and applications which depend on wide-coverage morphological analysis.KEYWORDS: computational linguistics; natural language processing; morphological analysis; full-form lexicon; diminutive formation. RESUMO: Um dos pré-requisitos para muitas tecnologias de processamento de linguagem natural é a disponibilidade de vastos recursos lexicais. Este artigo trata do MorphoBr, um projeto em desenvolvimento voltado para a construção de um léxico de formas plenas abrangente para a análise morfológica do português. Uma primeira versão do recurso já está disponível gratuitamente on-line sob uma licença de software livre e de código aberto. MorphoBr combina recursos livres análogos, corrigindo vários milhares de erros e lacunas. Em comparação com os recursos integrados, as entradas lexicais do MorphoBr seguem um formato mais amigável, o qual pode ser compilado diretamente em transdutores de estados finitos para análise morfológica, por exemplo, no contexto do parsing sintático com uma gramática no formalismo da LFG usando o sistema XLE. MorphoBr resulta de uma combinação de técnicas computacionais. Erros e lacunas mais óbvias nos recursos integrados foram automaticamente corrigidos com scripts. No entanto, a principal contribuição de MorphoBr é a expansão no inventário de substantivos e adjetivos. Isso foi alcançado pela modelação sistemática da formação de diminutivos no paradigma da morfologia de estados finitos. Isso possibilitou a MorphoBr superar de forma significativa recursos análogos na cobertura de diminutivos. Os primeiros resultados de avaliação mostram que o MorphoBr constitui uma iniciativa promissora que contribuirá de forma direta para conferir robustez a ferramentas e aplicações de processamento de linguagem natural que dependem de análise morfológica de ampla cobertura.PALAVRAS-CHAVE: linguística computacional; processamento de linguagem natural; análise morfológica; léxico de formas plenas; formação de diminutivos.


2021 ◽  
Author(s):  
Christoph Stich ◽  
Emmanouil Tranos ◽  
Max Nathan

This paper proposes a new methodological framework to identify economic clusters over space and time. We employ a unique open source dataset of geolocated and archived business webpages and interrogate them using Natural Language Processing to build bottom-up classi- fications of economic activities. We validate our method on an iconic UK tech cluster – Shoreditch, East London. We benchmark our results against existing case studies and admin- istrative data, replicating the main features of the cluster and providing fresh insights. As well as overcoming limitations in conventional industrial classification, our method addresses some of the spatial and temporal limitations of the clustering literature.


Author(s):  
Keliang Chen ◽  
Yunxiao Zu ◽  
Weizheng Ren

The digital processing of content resources has subverted the traditional paper content processing model and has also spread widely. The digital resources processed by text structure need to be structured and processed by professional knowledge, which can be saved as a professional digital content resource of knowledge base and provide basic metadata for intelligent knowledge service platform. The professional domain-based knowledge system construction system platform explored in this study is designed based on natural language processing. Natural language processing is an important branch of artificial intelligence, which is the application of artificial intelligence technology in linguistics. The system first extracts the professional thesaurus and domain ontology in the digital resources and then uses the new word discovery algorithm based on the label weight designed by artificial intelligence technology to intelligently extract and clean the new words of the basic thesaurus. At the same time, the relationship system between knowledge points and elements is established to realize the association extraction of targeted knowledge points, and finally the output content is enriched from knowledge points into related knowledge systems. In order to improve the scalability and universality of the system, the extended architecture of the thesaurus, algorithms, computational capabilities, tags, and exception thesaurus was taken into account when designing. At the same time, the implementation of “artificial intelligence [Formula: see text] manual assistance” was adopted. On the basis of improving the system availability, the experimental basis of the optimization algorithm is provided. The results of this research will bring an artificial intelligence innovation after the digitization to the publishing industry and will transform the content service into an intelligent service based on the knowledge system.


2013 ◽  
Vol 100 (1) ◽  
pp. 73-82 ◽  
Author(s):  
Anthony Rousseau

Abstract In this paper we describe XenC, an open-source tool for data selection aimed at Natural Language Processing (NLP) in general and Statistical Machine Translation (SMT) or Automatic Speech Recognition (ASR) in particular. Usually, when building a SMT or ASR system, the considered task is related to a specific domain of application, like news articles or scientific talks for instance. The goal of XenC is to allow selection of relevant data regarding the considered task, which will be used to build the statistical models for such a system. It is done by computing the difference between cross-entropy scores of sentences from a large out-of-domain corpus and sentences from a corpus considered as in-domain for the task. Written in C++, this tool can operate on monolingual or bilingual data and is language-independent. XenC, now part of the LIUM toolchain for SMT, is actively developed since December 2011 and used in many MT projects.


2015 ◽  
Vol 1 (1) ◽  
pp. 198-205
Author(s):  
Daniela Gîfu ◽  
Marius Cioca

AbstractThe paper presents the importance of analysis isotopes on anonymous readers’ comments as an important part of deep interpretation of texts. Furthermore, we describe a classification methodology of the anonymous readers’ comments on online articles, through the overlapping of isotopes, which completed the traditional analytical methods. Automatic recognition of isotopes is an important topic in Natural Language Processing (NLP), especially in the semantic disambiguation. The aim of this article is the automatic comparative analysis of the identified isotopes in articles and comments, which reveals an important part of online behavior. Moreover, we present a new tool that classifies the online commentators based on existing resources, open-source or freely available for research purposes. This study is intend to help direct beneficiaries (journalists, business, education, managers, PR specialists), but also specialists and researchers in the field of natural language processing, linguists, psychologists, etc.


Sign in / Sign up

Export Citation Format

Share Document