The Oxford Handbook of Computational Linguistics 2nd edition
Latest Publications


TOTAL DOCUMENTS

54
(FIVE YEARS 6)

H-INDEX

2
(FIVE YEARS 0)

Published By Oxford University Press

9780199573691

Author(s):  
Kevin Bretonnel Cohen

Computational linguistics has its origins in the post-Second World War research on translation of Russian-language scientific journal articles in the United States. Today, biomedical natural language processing treats clinical data, the scientific literature, and social media, with use cases ranging from studying adverse effects of drugs to interpreting high-throughput genomic assays (Névéol and Zweigenbaum 2018). Many of the most prominent research areas in the field involve extracting information from text and normalizing it to enormous databases of domain-relevant semantic classes, such as genes, diseases, and biological processes. Moving forward, the field is expected to play a significant role in understanding reproducibility in natural language processing.


Author(s):  
Constantin Orasan ◽  
Ruslan Mitkov

Natural Language Processing (NLP) is a dynamic and rapidly developing field in which new trends, techniques, and applications are constantly emerging. This chapter focuses mainly on recent developments in NLP which could not be covered in other chapters of the Handbook. Topics such as crowdsourcing and processing of large datasets, which are no longer that recent but are widely used and not covered at length in any other chapter, are also presented. The chapter starts by describing how the availability of tools and resources has had a positive impact on the field. The proliferation of user-generated content has led to the emergence of research topics such as sarcasm and irony detection, automatic assessment of user-generated content, and stance detection. All of these topics are discussed in the chapter. The field of NLP is approaching maturity, a fact corroborated by the latest developments in the processing of texts for financial purposes and for helping users with disabilities, two topics that are also discussed here. The chapter presents examples of how researchers have successfully combined research in computer vision and natural language processing to enable the processing of multimodal information, as well as how the latest advances in deep learning have revitalized research on chatbots and conversational agents. The chapter concludes with a comprehensive list of further reading material and additional resources.


Author(s):  
Ruslan Mitkov

This chapter provides a theoretical background of anaphora and introduces the varieties of this pervasive linguistic phenomenon. Next, it defines the task of anaphora resolution and introduces it as a three-stage process: identification of anaphors, location of the candidates for antecedents, and the resolution algorithm. After that, the chapter outlines a selection of influential and extensively cited anaphora resolution algorithms and proceeds to discuss issues related to the evaluation of anaphora resolution algorithms. Recent deep learning work on anaphora and coreference resolution is briefly presented as well. Finally, the chapter explains why anaphora resolution is important for various NLP applications.


Author(s):  
Patrick Hanks

This chapter discusses computational lexicography in two senses: the function of a lexicon in computer applications, and the use of computational techniques in compiling dictionaries. After a short historical survey, the article distinguishes scholarly dictionaries based on historical principles from practical synchronic dictionaries of contemporary words and meanings. Only the latter are suitable for computational applications, but many computational linguists are unaware of the difference. The chapter goes on to describe the ways in which computational techniques are bringing about radical changes in the methodology of compiling new dictionaries. It argues that future dictionaries, if they are to be maximally useful to both learners and computer programs, will need to make a more serious effort to report the stereotypical phraseology that is associated with each meaning of a word and the ways in which these stereotypes are exploited. Current developments and future possibilities are surveyed. The chapter closes with some suggestions for further reading. This article is designed to be read in conjunction with Chapter 3 of this volume.


Author(s):  
Michael P. Oakes

Author profiling is the analysis of people’s writing in an attempt to find out which classes they belong to, such as gender, age group or native language. Many of the techniques for author profiling are derived from the related task of Author Identification, so we will look at this topic first. Author identification is the task of finding out who is most likely to have written a disputed document, and there are a number of computational approaches to this. The three main subtasks are the compilation of corpora of texts known to be written by the candidate authors, the selection of linguistic features to represent those texts, and statistics for discriminating between those features which are most indicative of a particular author’s writing style. Plagiarism is the unacknowledged use of another author’s original work, and we will look at software for its detection. The chapter will cover the types of text obfuscation strategies used by plagiarists, commercial plagiarism detection software and its shortcomings, and recent research systems. Strategies have been developed for both external plagiarism detection (where the original source is searched for in a large document collection) and intrinsic plagiarism detection (where the source text is not available, necessitating a search for inconsistencies within the suspicious document). The specific problems of plagiarism by translation of an original in another language, and the unauthorized copying of sections of computer code, are described. Evaluation forums and publicly available test data sets are covered for each of the main topics of this chapter.


Author(s):  
Christer Samuelsson ◽  
Sanja Štajner

Current mainstream natural language processing tasks rely heavily on the use of statistical methods. This chapter presents statistical methods from their fundamentals to more complex methods, including hidden Markov and maximum entropy models applied in language modelling, and the expectation maximization method used in machine translation. It also introduces robust estimation and the methods for calculating inter-annotator agreement and statistical significance, as well as suggestions for further reading and some mathematical details of the presented statistical methods.


Author(s):  
Andrei Mikheev

Electronic text is essentially just a sequence of characters, but the majority of text processing tools operate in terms of linguistic units such as words and sentences. Tokenization is a process of segmenting text into words, and sentence splitting is the process of determining sentence boundaries in the text. In this chapter we describe major challenges for text tokenization and sentence splitting in different languages, and outline various computational approaches to tackling them.


Author(s):  
Omer Levy

A fundamental challenge in natural-language processing is to represent words as mathematical entities that can be read, reasoned, and manipulated by computational models. The current leading approach represents words as vectors in a continuous real-valued space, in such a way that similarities in the vector space correlate with semantic similarities between words. This chapter surveys various frameworks and methods for acquiring word vectors, while tying together related ideas and concepts.


Author(s):  
Thierry Dutoit ◽  
Yannis Stylianou

Text-to-speech (TTS) synthesis is the art of designing talking machines. Seen from this functional perspective, the task looks simple, but this chapter shows that delivering intelligible, natural-sounding, and expressive speech, while also taking into account engineering costs, is a real challenge. Speech synthesis has made a long journey from the big controversy in the 1980s, between MIT’s formant synthesis and Bell Labs’ diphone-based concatenative synthesis. While unit selection technology, which appeared in the mid-1990s, can be seen as an extension of diphone-based approaches, the appearance of Hidden Markov Models (HMM) synthesis around 2005 resulted in a major shift back to models. More recently, the statistical approaches, supported by advanced deep learning architectures, have been shown to advance text analysis and normalization as well as the generation of the waveforms. Important recent milestones have been Google’s Wavenet (September 2016) and the sequence-to-sequence models referred to as Tacotron (I and II).


Author(s):  
Ronald M. Kaplan

This chapter introduces some of the phenomena that theories of natural-language syntax aim to account for. It briefly discusses the correspondence between the sentences of a language and the semantic predicate-argument relations that they express, indicating how that correspondence is encoded in terms of word order, phrase structure, agreement, and valence. It surveys some of the grammatical notations, syntactic representations, and theoretical approaches that have figured prominently in linguistic research and that have particularly influenced the development of natural-language processing algorithms and implementations.


Sign in / Sign up

Export Citation Format

Share Document