MorphoBr: an open source large-coverage full-form lexicon for morphological analysis of Portuguese

ABSTRACT: One of the prerequisites for many natural language processing technologies is the availability of large lexical resources. This paper reports on MorphoBr, an ongoing project aiming at building a comprehensive full-form lexicon for morphological analysis of Portuguese. A first version of the resource is already freely available online under an open source, free software license. MorphoBr combines analogous free resources, correcting several thousand errors and gaps, and systematically adding new entries. In comparison to the integrated resources, lexical entries in MorphoBr follow a more user-friendly format, which can be straightforwardly compiled into finite-state transducers for morphological analysis, e.g. in the context of syntactic parsing with a grammar in the LFG formalism using the XLE system. MorphoBr results from a combination of computational techniques. Errors and the more obvious gaps in the integrated resources were automatically corrected with scripts. However, MorphoBr's main contribution is the expansion in the inventory of nouns and adjectives. This was carried out by systematically modeling diminutive formation in the paradigm of finite-state morphology. This allowed MorphoBr to significantly outperform analogous resources in the coverage of diminutives. The first evaluation results show MorphoBr to be a promising initiative which will directly contribute to the development of more robust natural language processing tools and applications which depend on wide-coverage morphological analysis.KEYWORDS: computational linguistics; natural language processing; morphological analysis; full-form lexicon; diminutive formation. RESUMO: Um dos pré-requisitos para muitas tecnologias de processamento de linguagem natural é a disponibilidade de vastos recursos lexicais. Este artigo trata do MorphoBr, um projeto em desenvolvimento voltado para a construção de um léxico de formas plenas abrangente para a análise morfológica do português. Uma primeira versão do recurso já está disponível gratuitamente on-line sob uma licença de software livre e de código aberto. MorphoBr combina recursos livres análogos, corrigindo vários milhares de erros e lacunas. Em comparação com os recursos integrados, as entradas lexicais do MorphoBr seguem um formato mais amigável, o qual pode ser compilado diretamente em transdutores de estados finitos para análise morfológica, por exemplo, no contexto do parsing sintático com uma gramática no formalismo da LFG usando o sistema XLE. MorphoBr resulta de uma combinação de técnicas computacionais. Erros e lacunas mais óbvias nos recursos integrados foram automaticamente corrigidos com scripts. No entanto, a principal contribuição de MorphoBr é a expansão no inventário de substantivos e adjetivos. Isso foi alcançado pela modelação sistemática da formação de diminutivos no paradigma da morfologia de estados finitos. Isso possibilitou a MorphoBr superar de forma significativa recursos análogos na cobertura de diminutivos. Os primeiros resultados de avaliação mostram que o MorphoBr constitui uma iniciativa promissora que contribuirá de forma direta para conferir robustez a ferramentas e aplicações de processamento de linguagem natural que dependem de análise morfológica de ampla cobertura.PALAVRAS-CHAVE: linguística computacional; processamento de linguagem natural; análise morfológica; léxico de formas plenas; formação de diminutivos.

Download Full-text

Finite-State Technology

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.39 ◽

2018 ◽

Author(s):

Mans Hulden

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Finite State Machines ◽

Regular Languages ◽

Finite State Automata ◽

State Machines ◽

Computational Phonology ◽

Finite State

Finite-state machines—automata and transducers—are ubiquitous in natural-language processing and computational linguistics. This chapter introduces the fundamentals of finite-state automata and transducers, both probabilistic and non-probabilistic, illustrating the technology with example applications and common usage. It also covers the construction of transducers, which correspond to regular relations, and automata, which correspond to regular languages. The technologies introduced are widely employed in natural language processing, computational phonology and morphology in particular, and this is illustrated through common practical use cases.

Download Full-text

Finite-state methods and models in natural language processing

Natural Language Engineering ◽

10.1017/s1351324911000015 ◽

2011 ◽

Vol 17 (2) ◽

pp. 141-144

Author(s):

ANSSI YLI-JYRÄ ◽

ANDRÁS KORNAI ◽

JACQUES SAKAROVITCH

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Current Issue ◽

Special Issue ◽

The Public ◽

The Past ◽

Finite State ◽

Final Selection

For the past two decades, specialised events on finite-state methods have been successful in presenting interesting studies on natural language processing to the public through journals and collections. The FSMNLP workshops have become well-known among researchers and are now the main forum of the Association for Computational Linguistics' (ACL) Special Interest Group on Finite-State Methods (SIGFSM). The current issue on finite-state methods and models in natural language processing was planned in 2008 in this context as a response to a call for special issue proposals. In 2010, the issue received a total of sixteen submissions, some of which were extended and updated versions of workshop papers, and others which were completely new. The final selection, consisting of only seven papers that could fit into one issue, is not fully representative, but complements the prior special issues in a nice way. The selected papers showcase a few areas where finite-state methods have less than obvious and sometimes even groundbreaking relevance to natural language processing (NLP) applications.

Download Full-text

Finite state methods in natural language processing

Natural Language Engineering ◽

10.1017/s1351324903003139 ◽

2003 ◽

Vol 9 (1) ◽

pp. 1-3 ◽

Cited By ~ 1

Author(s):

LAURI KARTTUNEN ◽

KIMMO KOSKENNIEMI ◽

GERTJAN VAN NOORD

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Summer School ◽

Special Issue ◽

Language Engineering ◽

Finite State ◽

State Models ◽

Finite State Models

Finite state methods have been in common use in various areas of natural language processing (NLP) for many years. A series of specialized workshops in this area illustrates this. In 1996, András Kornai organized a very successful workshop entitled Extended Finite State Models of Language. One of the results of that workshop was a special issue of Natural Language Engineering (Volume 2, Number 4). In 1998, Kemal Oflazer organized a workshop called Finite State Methods in Natural Language Processing. A selection of submissions for this workshop were later included in a special issue of Computational Linguistics (Volume 26, Number 1). Inspired by these events, Lauri Karttunen, Kimmo Koskenniemi and Gertjan van Noord took the initiative for a workshop on finite state methods in NLP in Helsinki, as part of the European Summer School in Language, Logic and Information. As a related special event, the 20th anniversary of two-level morphology was celebrated. The appreciation of these events led us to believe that once again it should be possible, with some additional submissions, to compose an interesting special issue of this journal.

Download Full-text

Natural Language Processing Tools

Cross-Disciplinary Advances in Applied Natural Language Processing ◽

10.4018/978-1-61350-447-5.ch002 ◽

2012 ◽

pp. 9-23 ◽

Cited By ~ 1

Author(s):

Justin F. Brunelle ◽

Chutima Boonthum-Denecke

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computer Science ◽

Open Source ◽

Computational Linguistics ◽

Language Processing ◽

Open Source Software

This chapter discusses a subset of Natural Language Processing (NLP) tools available for researchers and enthusiasts of computer science, computational linguistics, and other fields that may utilize or benefit from Natural Language Processing. Several tools are described in this chapter, along with background, algorithms used in brief, usages, and examples. While this chapter is not comprehensive, it provides an extensive exposure to various NLP tools through examples, and it aims at providing an overview of the resources available, and concentrates mainly on open-source applications. Open-source applications were chosen since they are freely available for download by all users. Commonly, open source software provides the code that makes up the tool, and allows for users to inspect the inner-workings of the tools, or even modify them. By using open source examples, readers of this chapter can extend their investigation of NLP tools beyond the pages of this text by investigating the tools outlined.

Download Full-text

A finite-state morphological grammar of Hebrew

Natural Language Engineering ◽

10.1017/s1351324906004384 ◽

2008 ◽

Vol 14 (2) ◽

pp. 173-190 ◽

Cited By ~ 11

Author(s):

S. YONA ◽

S. WINTNER

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Morphological Analysis ◽

Highly Efficient ◽

Crucial Component ◽

Finite State ◽

Modern Hebrew ◽

Morphological System ◽

State Technology

AbstractMorphological analysis is a crucial component of several natural language processing tasks, especially for languages with a highly productive morphology, where stipulating a full lexicon of surface forms is not feasible. This paper describes HAMSAH (HAifa Morphological System for Analyzing Hebrew), a morphological processor for Modern Hebrew, based on finite-state linguistically motivated rules and a broad coverage lexicon. The set of rules comprehensively covers the morphological, morpho-phonological and orthographic phenomena that are observable in contemporary Hebrew texts. Reliance on finite-state technology facilitates the construction of a highly efficient, completely bidirectional system for analysis and generation.

Download Full-text

Proceedings of the Second ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - TeachNLP '05

10.3115/1627291 ◽

2005 ◽

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing

Download Full-text

Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data

10.1101/19011643 ◽

2019 ◽

Author(s):

Daniel M. Bean ◽

James Teo ◽

Honghan Wu ◽

Ricardo Oliveira ◽

Raj Patel ◽

...

Keyword(s):

Atrial Fibrillation ◽

Natural Language Processing ◽

Natural Language ◽

Electronic Health Record ◽

Open Source ◽

Language Processing ◽

Risk Scores ◽

Free Text ◽

Health Record ◽

Electronic Health

AbstractAtrial fibrillation (AF) is the most common arrhythmia and significantly increases stroke risk. This risk is effectively managed by oral anticoagulation. Recent studies using national registry data indicate increased use of anticoagulation resulting from changes in guidelines and the availability of newer drugs.The aim of this study is to develop and validate an open source risk scoring pipeline for free-text electronic health record data using natural language processing.AF patients discharged from 1st January 2011 to 1st October 2017 were identified from discharge summaries (N=10,030, 64.6% male, average age 75.3 ± 12.3 years). A natural language processing pipeline was developed to identify risk factors in clinical text and calculate risk for ischaemic stroke (CHA2DS2-VASc) and bleeding (HAS-BLED). Scores were validated vs two independent experts for 40 patients.Automatic risk scores were in strong agreement with the two independent experts for CHA2DS2-VASc (average kappa 0.78 vs experts, compared to 0.85 between experts). Agreement was lower for HAS-BLED (average kappa 0.54 vs experts, compared to 0.74 between experts).In high-risk patients (CHA2DS2-VASc ≥2) OAC use has increased significantly over the last 7 years, driven by the availability of DOACs and the transitioning of patients from AP medication alone to OAC. Factors independently associated with OAC use included components of the CHA2DS2-VASc and HAS-BLED scores as well as discharging specialty and frailty. OAC use was highest in patients discharged under cardiology (69%).Electronic health record text can be used for automatic calculation of clinical risk scores at scale. Open source tools are available today for this task but require further validation. Analysis of routinely-collected EHR data can replicate findings from large-scale curated registries.

Download Full-text

Finite-State Methods and Natural Language Processing

10.1007/11780885 ◽

2006 ◽

Cited By ~ 1

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Finite State

Download Full-text

Advances in Computational Linguistics and Text Processing Frameworks

Advances in Computer and Electrical Engineering - Handbook of Research on Engineering Innovations and Technology Management in Organizations ◽

10.4018/978-1-7998-2772-6.ch012 ◽

2020 ◽

pp. 217-244

Author(s):

Ayush Srivastav ◽

Hera Khan ◽

Amit Kumar Mishra

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Text Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.

Download Full-text

Natural Language Processing

Annual Review of Applied Linguistics ◽

10.1017/s0267190500001446 ◽

1996 ◽

Vol 16 ◽

pp. 70-85 ◽

Cited By ~ 5

Author(s):

Thomas C. Rindflesch

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Domain Knowledge ◽

The State ◽

Point Of View ◽

Computer Applications ◽

Significant Progress ◽

Future Directions

Work in computational linguistics began very soon after the development of the first computers (Booth, Brandwood and Cleave 1958), yet in the intervening four decades there has been a pervasive feeling that progress in computer understanding of natural language has not been commensurate with progress in other computer applications. Recently, a number of prominent researchers in natural language processing met to assess the state of the discipline and discuss future directions (Bates and Weischedel 1993). The consensus of this meeting was that increased attention to large amounts of lexical and domain knowledge was essential for significant progress, and current research efforts in the field reflect this point of view.

Download Full-text