scholarly journals Making corpus data visible: visualising text with research intermediaries

Corpora ◽  
2017 ◽  
Vol 12 (3) ◽  
pp. 459-482 ◽  
Author(s):  
William Allen

Researchers using corpora can visualise their data and analyses using a growing number of tools. Visualisations are especially valuable in environments where researchers communicate and work with public-facing partners under the auspices of ‘knowledge exchange’ or ‘impact’, and corpus data are more available thanks to digital methods. However, although the field of corpus linguistics continues to generate its own range of techniques, it largely remains orientated towards finding ways for academics to communicate results directly with other academics rather than with or through groups outside universities. Also, there is a lack of discussion about how communication, motivations and values also feature in the process of making corpus data visible. My argument is that these sociocultural and practical factors also influence visualisation outputs alongside technical aspects. I draw upon two corpus-based projects about press portrayal of migrants, conducted by an intermediary organisation that links university researchers with users outside academia. Analysing these projects' visualisation outputs in their organisational and communication contexts produces key lessons for researchers wanting to visualise text; consider the aims and values of partners; develop communication strategies that acknowledge different areas of expertise; and link visualisation choices with wider project objectives.

Author(s):  
Erla Hallsteinsdóttir

Multiword expressions – i.e. phraseological units – like idioms and collocations are one of the most interesting part of every language. In this article, I investigate phraseological units from a lexicographical point of view. I discuss the theoretical and methodological basis of phraseography as a discipline that includes aspects of lexicography, phraseology, corpus linguistics and theories of language learning. I demonstrate the importance of corpora as a source for the lexicographer and the use of corpus data. I also discuss the requirements for the lexicographical treatment of phraseological units by the compilation of a phraseological database for language learners in relation to their assumed needs that have already been described in detail.


2021 ◽  
Vol 8 (2) ◽  
pp. 79-91
Author(s):  
Zuraidah Mohd Don ◽  
Gerry Knowles

This paper is intended for researchers involved in or contemplating research in corpus linguistics, and is concerned in particular with the language of corpus linguistics. It introduces and explains technical terms in the context in which they are normally used. Technical terms lead on to the concepts to which they refer, and the concepts are related to the procedures, including tagging and parsing, by which they are implemented. English and Malay are used as the languages of illustration, and for the benefit of readers who do not know Malay, Malay examples are translated into English. The paper has a historical dimension, and the language of corpus linguistics is traced to traditional usage in the language classroom, and in particular to the study of Latin in Europe. The inheritance from the past is evident in the design of MaLex, which is a working device that does empirical Malay corpus linguistics, and is presented here as a contribution to the digital humanities.


Author(s):  
Stefan Th. Gries

Abstract This paper discusses the degree to which some of the most widely-used measures of association in corpus linguistics are not particularly valid in the sense of actually measuring association rather than some amalgam of a lot of frequency and a little association. The paper demonstrates these issues on the basis of hypothetical and actual corpus data and outlines implications of the findings. I then outline how to design an association measure that only measures association and show that its behavior supports the use of the log odds ratio as a true association-only measure but separately from frequency; in addition, this paper sets the stage for an analogous review of dispersion measures in corpus linguistics.


2020 ◽  
pp. 1329878X2094712
Author(s):  
Monika Bednarek ◽  
Georgia Carr

Digital methods are becoming more and more important for text analysis in communications research. However, many computational methods require either relevant technical expertise or multi-disciplinary collaboration, which has impeded their uptake. This article introduces an alternative: computer-assisted linguistic analysis (corpus linguistics), an approach that is increasingly being used outside linguistics and requires less expertise. The article uses a dataset of almost 700 items of health news to demonstrate how such techniques can aid the analysis of (dis)preferred language, sources, stigma and responsibility, framing, and project-specific text analysis. We conclude with an evaluation of the key advantages and limitations of corpus linguistic analysis.


2021 ◽  
Vol 9 (1) ◽  
pp. 35-62
Author(s):  
Nele Põldvere ◽  
Johan Frid ◽  
Victoria Johansson ◽  
Carita Paradis

This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material.


2018 ◽  
Vol 26 (4) ◽  
pp. 1531 ◽  
Author(s):  
George Christodoulides

Abstract: In this article we investigate the acoustic correlates of prosodic boundaries in French speech. We compare the prosodic structure annotation performed by experts in two multi-genre corpora (Rhapsodie and LOCAS-F). A uniform analysis procedure is applied to both corpora. The results show that the main acoustic correlates of prosodic boundaries are silent pauses and pre-boundary syllable lengthening. Pitch movements contribute to the perception of boundaries but are essentially correlates of boundary function, rather than boundary strength. Two levels of four-level annotation of boundary strength in the Rhapsodie corpus (periods and packages) correspond to the two-levels of strength in the LOCAS-F corpus.Keywords: prosody; speech segmentation; prosodic boundaries; corpus linguistics; French.Resumo: Neste artigo investigamos os correlatos acústicos de fronteiras prosódicas da fala em língua francesa. Comparamos a anotação da estrutura prosódica efetuada por anotadores experts em dois corpora multigêneros (Rhapsodie e LOCAS-F). Um procedimento de análise uniforme é aplicado a ambos os corpora. Os resultados indicam que os principais correlatos acústicos de fronteiras prosódicas são pausa silenciosa e alongamento da sílaba pré-fronteira. Movimentos de pitch contribuem para a percepção de fronteiras mas são essencialmente correlatos de funções de fronteira, e não de força de fronteira. Dois dos níveis de anotação dos quatro níveis de anotação de força de fronteira do corpus Rhapsodie (períodos e pacotes) correspondem aos dois níveis de intensidade do corpus LOCAS-F.Palavras-chave: prosódia; segmentação da fala; fronteiras prosódicas; linguística de corpus; francês.


2021 ◽  
pp. 162-177
Author(s):  
Antra Kļavinska ◽  

Several text corpora have been created in Latvia, including learner corpora. One of the latest projects is the Latvian Language Learner Corpus (LaVA), which contains the works of international students studying in Latvian higher education institutions who are learning Latvian as a foreign language. The texts are morphologically tagged automatically, and learner errors are tagged manually. A sufficient scope of publications is available, which provides the theoretical basis for the creation of Latvian language learner corpora; however, there is a lack of studies or practical methodological guidelines concerning the opportunities for their application, and there is little data about the use of text corpora in language acquisition. The aim of this study is to explain from the theoretical perspective for what purposes learner corpus data may be used, as well as to illustrate the methodological groundwork with examples from the LaVA corpus. Analysis of theoretical literature has demonstrated the functions and meaning of learner corpora in research, and experience with the use of corpora in acquiring a foreign language has been analysed. Examples of the use of the LaVA corpus as a didactic resource have been prepared using Corpus Linguistics methods. The study was conducted within the state research programme project “The Latvian Language”. After studying the functions of learner corpora from the theoretical perspective, it was concluded that the target audience of the LaVA corpus mainly includes teachers of Latvian as a foreign language (LATS), authors of teaching materials, as well as Latvian language learners. To facilitate the use of the LaVA corpus, it is important to have basic knowledge of Corpus Linguistics, an understanding of the theory of language, as well as an understanding of foreign language teaching methodology. LATS teachers can use the LaVA corpus data in the creation of curricula and teaching materials, in the preparation of language proficiency tests, etc. Using the inductive approach in language acquisition, language learners can also become language researchers, can analyse the errors of other learners, etc. Undeniably, the LaVA corpus can be used in broader linguistic research, for example, in contrastive interlanguage analysis, comparing the data of language learners with the data of native speakers or the data of different groups of language learners.


2019 ◽  
Vol 15 (2) ◽  
pp. 383-417 ◽  
Author(s):  
Roland Schäfer

AbstractOver the past years, multifactorial corpus-based explorations of alternations in grammar have become an accepted major tool in cognitively oriented corpus linguistics. For example, prototype theory as a theory of similarity-based and inherently probabilistic linguistic categorization has received support from studies showing that alternating constructions and items often occur with probabilities influenced by prototypical formal, semantic or contextual factors. In this paper, I analyze a low-frequency alternation effect in German noun inflection in terms of prototype theory, based on strong hypotheses from the existing literature that I integrate into an established theoretical framework of usage-based probabilistic morphology, which allows us to account for similarity effects even in seemingly regular areas of the grammar. Specifically, the so-calledweakmasculine nouns in German, which follow an unusual pattern of case marking and often have characteristic lexical properties, sporadically occur in forms of the dominantstrongmasculine nouns. Using data from the nine-billion-token DECOW12A web corpus of contemporary German, I demonstrate that the probability of the alternation is influenced by the presence or absence of semantic, phonotactic, and paradigmatic features. Token frequency is also shown to have an effect on the alternation, in line with common assumptions about the relation between frequency and entrenchment. I use a version of prototype theory with weighted features and polycentric categories, but I also discuss the question of whether such corpus data can be taken as strong evidence for or against specific models of cognitive representation (prototypes vs. exemplars).


1998 ◽  
Vol 3 (2) ◽  
pp. 189-210 ◽  
Author(s):  
Jan Aarts ◽  
Hans van Halteren ◽  
Nelleke Oostdijk

The article discusses the role of linguistic annotation in corpus linguistics as opposed to annotation in natural language processing. In corpus linguistics, annotation is an integral part of the process of linguistic interpretation and description of the data. Tagging and parsing are discussed as the automatic counterparts of, respectively, the paradigmatic and the syntagmatic description of corpus data. The requirements for a corpus linguistic annotation system are considered. An account is given of the TOSCA analysis system as representative of such an annotation system. Performance results of the system are given, and an evaluation is made.


2011 ◽  
Vol 16 (1) ◽  
pp. 3-44 ◽  
Author(s):  
Michael Barlow

This paper examines the relationship between corpus linguistics and theoretical linguistics from a variety of standpoints. We consider the nature of the fit between particular theoretical approaches and the three areas in which corpus linguistics has made a significant contribution to our understanding of language: the provision of frequency information, the highlighting of the importance of collocations, and the description of variation and text types. The complex relationship between data, theory, and representation is described with the aim of situating corpus-based research with respect to different linguistic theories, looking broadly at British and American traditions and paying particular attention to usage-based models of language. We then briefly discuss some current issues surrounding theoretical developments within corpus linguistics, including the divide between cognitive and social perspectives; the representation of corpus-based generalisations; and the relationship between patterns in corpus data and patterns in the mind.


Sign in / Sign up

Export Citation Format

Share Document