Corpus Linguistics: Quantitative Methods

Author(s):  
Stefan Th. Gries
2016 ◽  
Vol 21 (4) ◽  
pp. 439-464 ◽  
Author(s):  
Douglas Biber ◽  
Randi Reppen ◽  
Erin Schnur ◽  
Romy Ghanem

This paper explores the effectiveness of Juilland’s D as a measure of vocabulary dispersion in large corpora. Through a series of experiments using the BNC, we explored the influence of three variables: the number of corpus-parts used for the computation of D, the frequency of the target word, and the distributions of those words. The experiments demonstrate that the effective range for D is greatly reduced when computations are based on a large number of corpus-parts: even words with highly skewed distributions have D values indicating a relatively uniform distribution. We also briefly explore an alternative measure, Gries’ DP (Gries 2008), showing that it is a more reliable and effective measure of dispersion in a large corpus divided into many parts. In conclusion, we discuss the implications of these findings for quantitative methods applied to the creation of vocabulary lists as well as research questions in other areas of corpus linguistics.


2018 ◽  
Vol 1 (2) ◽  
pp. 277-309 ◽  
Author(s):  
Stefan Th. Gries

Abstract This paper critically discusses how corpus linguistics in general, but learner corpus research in particular, has been dealing with all sorts of frequency data in general, but over- and underuse frequencies in particular. I demonstrate on the basis of learner corpus data the pitfalls of using aggregate data and lacking statistical control that much work is unfortunately characterized by. In fact, I will demonstrate that monofactorial methods have very little to offer at all to research on observational data. While this paper is admittedly very didactic and methodological, I think the discussion of the empirical data offered here – a reanalysis of previously published work – shows how misleading many studies potentially and provides far-reaching implications for much of corpus linguistics and learner corpus research. Ideally/maximally, this paper together with Paquot & Plonsky (2017, Intntl. J. of Learner Corpus Research) would lead to a complete revision of how learner corpus linguists use quantitative methods and study over-/underuse; minimally, this paper would stimulate a much-needed discussion of currently lacking methodological sophistication.


Author(s):  
Marco Marelli

Corpora are an all-important resource in linguistics, as they constitute the primary source for large-scale examples of language usage. This has been even more evident in recent years, with the increasing availability of texts in digital format leading more and more corpus linguistics toward a “big data” approach. As a consequence, the quantitative methods adopted in the field are becoming more sophisticated and various. When it comes to morphology, corpora represent a primary source of evidence to describe morpheme usage, and in particular how often a particular morphological pattern is attested in a given language. There is hence a tight relation between corpus linguistics and the study of morphology and the lexicon. This relation, however, can be considered bi-directional. On the one hand, corpora are used as a source of evidence to develop metrics and train computational models of morphology: by means of corpus data it is possible to quantitatively characterize morphological notions such as productivity, and corpus data are fed to computational models to capture morphological phenomena at different levels of description. On the other hand, morphology has also been applied as an organization principle to corpora. Annotations of linguistic data often adopt morphological notions as guidelines. The resulting information, either obtained from human annotators or relying on automatic systems, makes corpora easier to analyze and more convenient to use in a number of applications.


2021 ◽  
Author(s):  
Paul Baker ◽  
Rachelle Vessey ◽  
Tony McEnery

How do violent jihadists use language to try to persuade people to carry out violent acts? This book analyses over two million words of texts produced by violent jihadists to identify and examine the linguistic strategies employed. Taking a mixed methods approach, the authors combine quantitative methods from corpus linguistics, which allows the identification of frequent words and phrases, alongside close reading of texts via discourse analysis. The analysis compares language use across three sets of texts: those which advocate violence, those which take a hostile but non-violent standpoint, and those which take a moderate perspective, identifying the different uses of language associated with different stages of radicalization. The book also discusses how strategies including use of Arabic, romanisation, formal English, quotation, metaphor, dehumanisation and collectivisation are used to create in- and out-groups and justify violence.


Author(s):  
Sascha Hinkel ◽  
Jörg Hörnschemeyer

Abstract Eugenio Pacelli, the later Pope Pius XII, was already considered a leading diplomat of the Holy See when he served as nuncio in Germany. Using the critical online edition of his nuncial reports, digital humanities methodologies will be used to explore which topics dominated his reports, the diplomatic style of his nunciature and whether Pacelli adapted the form and content of his reports to the various recipients. To this end, quantitative methods of textual analysis such as the frequency distribution of types in relation to tokens as well as recourse to key terms and multi-word units (N-grams) will be employed. Moreover, methods from the field of corpus linguistics, information retrieval (TF-IDF) and quantitative stylometry (contrastive analyses with Burrows’ Delta and Zeta) are applied to evaluate both stylistic and content-related issues.


2017 ◽  
Vol 47 (2) ◽  
pp. 18-25
Author(s):  
Iveta Dinžíková

This article studies the phrasemes comprising an ethnonym in the source language (French) as well as the target language (Slovak). This approach is contrastive and the phrasemes have been classified according to the type of equivalence (total equivalent, partial equivalent and phrasemes without equivalent). The aim of the research was to analyse 27 phrasemes with the help of the corpus linguistics method (relative frequency and logDice association measure), and four monolingual corpora (the corpora frTenTen12, skTenTen11, Emolex, prim-7.0-public-all) with approx. 130 million up to approx. 10 billion words in each of them, so it is a fairly wide range of language materials.Firstly, we focus on the current state of French and Slovak phraseology. We present the distribution of phrasemes into three types: general, professional and so-called mixed (of which the last type represents our own proposition). Then, by translating the source language-culture into the target language-culture, we demonstrate the three basic types of phrasemes equivalence but our attention is on the first two types. Afterwards, we present quantitative methods of corpus linguistics (four monolingual corpora, relative frequency and the logDice association measure). Then, we analyse 27 specific phrasemes. This is qualitative analysis (their distribution into three types of equivalence as well as their repair in general, professional and mixed phrasemes), but also quantitative analysis (analysis based on relative frequency and also on logDice association measure). In the end, we demonstrate and evaluate the results of our research.The research objectives are set to find out the frequency of phrasemes in various types of texts and the level of their specificity within the framework of each of the corpora, based on which it is possible to propose which of the phrasemes should be placed at the forefront for looking up an entry and its individual components or a phraseme as a whole, and thus contribute to supporting the creation of current French/Slovak lexicography and phraseography.Moreover, from the point of view of teaching foreign languages, we can use the second type of equivalence as a contrasting factor between French and Slovak language-cultures because they can either easily interfere with other phrasemes of the target language-culture or be not well understood in the target language-culture.


Author(s):  
Gard B. Jenset ◽  
Barbara McGillivray

An innovative guide to quantitative, corpus-based research in historical and diachronic linguistics, this book provides an original and thoroughly worked-out methodological framework, which encompasses the entire research process. The authors argue that, although historical linguistics has been successful in using the comparative method, the field lags behind other branches of linguistics with respect to adopting quantitative methods. In a theoretically agnostic way, the book provides a framework for quantitatively assessing models and hypotheses in historical linguistics, based on corpus data. Using case studies, the authors illustrate how research questions in historical linguistics can be answered within a framework of quantitative corpus linguistics. With an eye for the needs of researchers, the book explains and exemplifies the benefits of working with quantitative methods, corpus data, corpus annotation, and the benefits of open and reproducible research. Historical corpora, corpus annotation, and historical language resources are discussed in depth, with the aim of enabling researchers to identify appropriate existing resources, or creating their own. The view of quantitative corpus linguistics advocated here offers a unified account of how they fit into the bigger research picture of historical linguistics research.


Sign in / Sign up

Export Citation Format

Share Document