Discourse structure and language technology

AbstractAn increasing number of researchers and practitioners in Natural Language Engineering face the prospect of having to work with entire texts, rather than individual sentences. While it is clear that text must have useful structure, its nature may be less clear, making it more difficult to exploit in applications. This survey of work on discourse structure thus provides a primer on the bases of which discourse is structured along with some of their formal properties. It then lays out the current state-of-the-art with respect to algorithms for recognizing these different structures, and how these algorithms are currently being used in Language Technology applications. After identifying resources that should prove useful in improving algorithm performance across a range of languages, we conclude by speculating on future discourse structure-enabled technology.

Download Full-text

Anniversary article: Then and now: 25 years of progress in natural language engineering

Natural Language Engineering ◽

10.1017/s1351324919000081 ◽

2019 ◽

Vol 25 (3) ◽

pp. 405-418

Author(s):

John Tait ◽

Yorick Wilks

Keyword(s):

Natural Language ◽

Speech Processing ◽

State Of The Art ◽

Language Engineering ◽

Part Of Speech Tagging ◽

Ethical Implications ◽

Part Of Speech ◽

Current State ◽

Metaphor Processing ◽

Speech Tagging

AbstractThe paper reviews the state of the art of natural language engineering (NLE) around 1995, when this journal first appeared, and makes a critical comparison with the current state of the art in 2018, as we prepare the 25th Volume. Specifically the then state of the art in parsing, information extraction, chatbots, and dialogue systems, speech processing and machine translation are briefly reviewed. The emergence in the 1980s and 1990s of machine learning (ML) and statistical methods (SM) is noted. Important trends and areas of progress in the subsequent years are identified. In particular, the move to the use of n-grams or skip grams and/or chunking with part of speech tagging and away from whole sentence parsing is noted, as is the increasing dominance of SM and ML. Some outstanding issues which merit further research are briefly pointed out, including metaphor processing and the ethical implications of NLE.

Download Full-text

Natural language technology and query expansion: issues, state-of-the-art and perspectives

Journal of Intelligent Information Systems ◽

10.1007/s10844-011-0174-3 ◽

2011 ◽

Vol 38 (3) ◽

pp. 709-740 ◽

Cited By ~ 6

Author(s):

Bhawani Selvaretnam ◽

Mohammed Belkhatir

Keyword(s):

Natural Language ◽

Query Expansion ◽

State Of The Art ◽

Language Technology

Download Full-text

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Journal of Artificial Intelligence Research ◽

10.1613/jair.5477 ◽

2018 ◽

Vol 61 ◽

pp. 65-170 ◽

Cited By ~ 68

Author(s):

Albert Gatt ◽

Emiel Krahmer

Keyword(s):

Natural Language ◽

State Of The Art ◽

Natural Language Generation ◽

Data Driven ◽

Research Topics ◽

Language Generation ◽

The Past ◽

Current State ◽

Linguistic Input ◽

New Applications

This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past two decades, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of NLP, with an emphasis on different evaluation methods and the relationships between them.

Download Full-text

Natural language interfaces to databases – an introduction

Natural Language Engineering ◽

10.1017/s135132490000005x ◽

1995 ◽

Vol 1 (1) ◽

pp. 29-81 ◽

Cited By ~ 283

Author(s):

I. Androutsopoulos ◽

G.D. Ritchie ◽

P. Thanisch

Keyword(s):

Natural Language ◽

Computational Linguistics ◽

State Of The Art ◽

Query Languages ◽

Natural Language Interfaces ◽

Advantages And Disadvantages ◽

Current State ◽

Database Updates ◽

Graphical Interfaces ◽

History Of

AbstractThis paper is an introduction to natural language interfaces to databases (NLIDBS). A brief overview of the history of NLIDBS is first given. Some advantages and disadvantages of NLIDBS are then discussed, comparing NLIDBS to formal query languages, form-based interfaces, and graphical interfaces. An introduction to some of the linguistic problems NLIDBS have to confront follows, for the benefit of readers less familiar with computational linguistics. The discussion then moves on to NLIDB architectures, portability issues, restricted natural language input systems (including menu-based NLIDBS), and NLIDBS with reasoning capabilities. Some less explored areas of NLIDB research are then presented, namely database updates, meta-knowledge questions, temporal questions, and multi-modal NLIDBS. The paper ends with reflections on the current state of the art.

Download Full-text

Natural language interfaces to databases

The Knowledge Engineering Review ◽

10.1017/s0269888900005476 ◽

1990 ◽

Vol 5 (4) ◽

pp. 225-249 ◽

Cited By ~ 52

Author(s):

Ann Copestake ◽

Karen Sparck Jones

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Central Process ◽

Current State ◽

Natural Language Question ◽

The One ◽

Language Question ◽

And Task

AbstractThis paper reviews the current state of the art in natural language access to databases. This has been a long-standing area of work in natural language processing. But though some commercial systems are now available, providing front ends has proved much harder than was expected, and the necessary limitations on front ends have to be recognized. The paper discusses the issues, both general to language and task-specific, involved in front end design, and the way these have been addressed, concentrating on the work of the last decade. The focus is on the central process of translating a natural language question into a database query, but other supporting functions are also covered. The points are illustrated by the use of a single example application. The paper concludes with an evaluation of the current state, indicating that future progress will depend on the one hand on general advances in natural language processing, and on the other on expanding the capabilities of traditional databases.

Download Full-text

Applications of term identification technology: domain description and content characterisation

Natural Language Engineering ◽

10.1017/s1351324999002090 ◽

1999 ◽

Vol 5 (1) ◽

pp. 17-44 ◽

Cited By ~ 10

Author(s):

BRANIMIR BOGURAEV ◽

CHRISTOPHER KENNEDY

Keyword(s):

Language Processing ◽

State Of The Art ◽

Operational Environment ◽

Text Indexing ◽

Language Engineering ◽

Current State ◽

Domain Description ◽

Technology Domain ◽

Term Identification ◽

Technical Terms

The identification and extraction of technical terms is one of the better understood and most robust Natural Language Processing (NLP) technologies within the current state of the art of language engineering. In generic information management contexts, terms have been used primarily for procedures seeking to identify a set of phrases that is useful for tasks such as text indexing, computational lexicology, and machine-assisted translation: such tasks make important use of the assumption that terminology is representative of a given domain. This paper discusses an extension of basic terminology identification technology for the application to two higher level semantic tasks: domain description, the specification of the technical domain of a document, and content characterisation, the construction of a compact, coherent and useful representation of the topical content of a text. With these extensions, terminology identification becomes the foundation of an operational environment for document processing and content abstraction.

Download Full-text

Computational Treatment of Multiword Expressions

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.56 ◽

2018 ◽

Author(s):

Carlos Ramisch ◽

Aline Villavicencio

Keyword(s):

Natural Language ◽

Language Processing ◽

Word Sense Disambiguation ◽

Word Sense ◽

Language Generation ◽

Multiword Expressions ◽

Language Technology ◽

Sense Disambiguation ◽

Technology Applications ◽

Nominal Compounds

In natural-language processing, multiword expressions (MWEs) have been the focus of much attention in their many forms, including idioms, nominal compounds, verbal expressions, and collocations. In addition to their relevance for lexicographic and terminographic work, their ubiquity in language affects the performance of tasks like parsing, word sense disambiguation, and natural-language generation. They lend a mark of naturalness and fluency to applications that can deal with them, ranging from machine translation to information retrieval. This chapter presents an overview of their linguistic characteristics and discusses a variety of proposals for incorporating them into language technology, covering type-based discovery, token-based identification, and MWE-aware language technology applications.

Download Full-text

Inherent Disagreements in Human Textual Inferences

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00293 ◽

2019 ◽

Vol 7 ◽

pp. 677-694

Author(s):

Ellie Pavlick ◽

Tom Kwiatkowski

Keyword(s):

Natural Language ◽

State Of The Art ◽

Current State ◽

Textual Entailment ◽

Recognizing Textual Entailment

We analyze human’s disagreements about the validity of natural language inferences. We show that, very often, disagreements are not dismissible as annotation “noise”, but rather persist as we collect more ratings and as we vary the amount of context provided to raters. We further show that the type of uncertainty captured by current state-of-the-art models for natural language inference is not reflective of the type of uncertainty present in human disagreements. We discuss implications of our results in relation to the recognizing textual entailment (RTE)/natural language inference (NLI) task. We argue for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments.

Download Full-text

An Overview of Biomolecular Event Extraction from Scientific Documents

Computational and Mathematical Methods in Medicine ◽

10.1155/2015/571381 ◽

2015 ◽

Vol 2015 ◽

pp. 1-19 ◽

Cited By ~ 2

Author(s):

Jorge A. Vanegas ◽

Sérgio Matos ◽

Fabio González ◽

José L. Oliveira

Keyword(s):

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Biomedical Literature ◽

Event Extraction ◽

Automatic Extraction ◽

Biological Processes ◽

Scientific Texts ◽

Research Areas ◽

Current State

This paper presents a review of state-of-the-art approaches to automatic extraction of biomolecular events from scientific texts. Events involving biomolecules such as genes, transcription factors, or enzymes, for example, have a central role in biological processes and functions and provide valuable information for describing physiological and pathogenesis mechanisms. Event extraction from biomedical literature has a broad range of applications, including support for information retrieval, knowledge summarization, and information extraction and discovery. However, automatic event extraction is a challenging task due to the ambiguity and diversity of natural language and higher-level linguistic phenomena, such as speculations and negations, which occur in biological texts and can lead to misunderstanding or incorrect interpretation. Many strategies have been proposed in the last decade, originating from different research areas such as natural language processing, machine learning, and statistics. This review summarizes the most representative approaches in biomolecular event extraction and presents an analysis of the current state of the art and of commonly used methods, features, and tools. Finally, current research trends and future perspectives are also discussed.

Download Full-text

Meemi: A simple method for post-processing and integrating cross-lingual word embeddings

Natural Language Engineering ◽

10.1017/s1351324921000280 ◽

2021 ◽

pp. 1-23

Author(s):

Yerai Doval ◽

Jose Camacho-Collados ◽

Luis Espinosa-Anke ◽

Steven Schockaert

Keyword(s):

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Orthogonal Transformation ◽

Word Embeddings ◽

Initial Alignment ◽

Simple Method ◽

Word Similarity ◽

Current State ◽

Cross Lingual

Abstract Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual embeddings define a multilingual space where word embeddings from two or more languages are integrated together. Current state-of-the-art approaches learn these embeddings by aligning two disjoint monolingual vector spaces through an orthogonal transformation which preserves the structure of the monolingual counterparts. In this work, we propose to apply an additional transformation after this initial alignment step, which aims to bring the vector representations of a given word and its translations closer to their average. Since this additional transformation is non-orthogonal, it also affects the structure of the monolingual spaces. We show that our approach both improves the integration of the monolingual spaces and the quality of the monolingual spaces themselves. Furthermore, because our transformation can be applied to an arbitrary number of languages, we are able to effectively obtain a truly multilingual space. The resulting (monolingual and multilingual) spaces show consistent gains over the current state-of-the-art in standard intrinsic tasks, namely dictionary induction and word similarity, as well as in extrinsic tasks such as cross-lingual hypernym discovery and cross-lingual natural language inference.

Download Full-text