Research in Corpus Linguistics
Latest Publications


TOTAL DOCUMENTS

79
(FIVE YEARS 63)

H-INDEX

0
(FIVE YEARS 0)

Published By Research In Corpus Linguistics

2243-4712
Updated Saturday, 10 July 2021

2021 ◽  
Vol 9 (1) ◽  
pp. 104-131
Author(s):  
Lassi Saario ◽  
Tanja Säily ◽  
Samuli Kaislaniemi ◽  
Terttu Nevalainen

This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspondence Extension (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa 3.7pp. The combination of POS tagging and social metadata will make the corpus attractive to linguists interested in the interplay between language-internal and -external factors affecting variation and change.


2021 ◽  
Vol 9 (1) ◽  
pp. 35-62
Author(s):  
Nele Põldvere ◽  
Johan Frid ◽  
Victoria Johansson ◽  
Carita Paradis

This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material.


2021 ◽  
Vol 9 (1) ◽  
pp. 19-34
Author(s):  
Mikko Tolonen ◽  
Eetu Mäkelä ◽  
Ali Ijaz ◽  
Leo Lahti

Eighteenth Century Collections Online (ECCO) is the most comprehensive dataset available in machine-readable form for eighteenth-century printed texts. It plays a crucial role in studies of eighteenth-century language and it has vast potential for corpus linguistics. At the same time, it is an unbalanced corpus that poses a series of different problems. The aim of this paper is to offer a general overview of ECCO for corpus linguistics by analysing, for example, its publication countries and languages. We will also analyse the role of the substantial number of reprints and new editions in the data, discuss genres and the estimates of Optical Character Recognition (OCR) quality. Our conclusion is that whereas ECCO provides a valuable source for corpus linguistics, scholars need to pay attention to historical source criticism. We have highlighted key aspects that need to be taken into consideration when considering its possible uses.


2021 ◽  
Vol 10 (1) ◽  
pp. 45-62
Author(s):  
Ece Genç-Yöntem ◽  
Evrim Eveyik-Aydın

Although compiling a spoken learner corpus is not a recent enterprise, the number of developmental learner spoken corpora in the field of corpus linguistics is not satisfactory. This report describes the compilation of the Yeditepe Spoken Corpus of Learner English (YESCOLE), a 119,787-word corpus of Turkish students’ spoken English at tertiary level. YESCOLE was compiled to generate a developmental corpus of spoken interlanguage by collecting samples from learners of different English proficiency levels at regular short intervals over seven months. In order to shed light on the laborious methodology of compiling the developmental spoken learner corpus, this paper elucidates the steps taken to build YESCOLE and discusses its potential benefits for research and instructional purposes.


2021 ◽  
Vol 10 (1) ◽  
pp. 31-44
Author(s):  
Gaëtanelle Gilquin

The Process Corpus of English in Education (PROCEED) is a learner corpus of English which, in addition to written texts, consists of data that make the writing process visible in the form of keystroke log files and screencast videos. It comes with rich metadata about each learner, among which indices of exposure to the target language and cognitive measures such as working memory or fluid intelligence. It also includes an L1 component which is made up of similar data produced by the learners in their mother tongue. PROCEED opens new perspectives in the study of learner writing, by going beyond the written product. It makes it possible to investigate aspects such as writing fluency, use of online resources, cognitive phenomena like automaticity and avoidance, or theoretical modelling of the writing process. It also has applications for teaching, e.g. by showing students screencast video clips from the corpus illustrating effective writing strategies, as well as for testing, e.g. by establishing a corpus-derived standard of writing fluency for learners at a certain proficiency level.


2021 ◽  
Vol 10 (1) ◽  
pp. 63-88
Author(s):  
Marie-Louise Brunner ◽  
Stefan Diemer

The article discusses how to integrate annotation for nonverbal elements (NVE) from multimodal raw data as part of a standardized corpus transcription. We argue that it is essential to include multimodal elements when investigating conversational data, and that in order to integrate these elements, a structured approach to complex multimodal data is needed. We discuss how to formulate a structured corpus-suitable standard syntax and taxonomy for nonverbal features such as gesture, facial expressions, and physical stance, and how to integrate it in a corpus. Using corpus examples, the article describes the development of a robust annotation system for spoken language in the corpus of Video-mediated English as a Lingua Franca Conversations (ViMELF 2018) and illustrates how the system can be used for the study of spoken discourse. The system takes into account previous research on multimodality, transcribes salient nonverbal features in a concise manner, and uses a standard syntax. While such an approach introduces a degree of subjectivity through the criteria of salience and conciseness, the system also offers considerable advantages: it is versatile and adaptable, flexible enough to work with a wide range of multimodal data, and it allows both quantitative and qualitative research on the pragmatics of interaction.


2021 ◽  
Vol 9 (2) ◽  
pp. 1-33
Author(s):  
Stefan Th. Gries

A widely-used method in corpus-linguistic approaches to discourse analysis, register/text type/genre analysis, and educational/curriculum questions is that of keywords analysis, a simple statistical method aiming to identify words that are key to, i.e. characteristic for, certain discourses, text types, or topic domains. The vast majority of keywords analyses relied on the same statistical measure that most collocation studies are using, the log-likelihood ratio, which is performed on frequencies of occurrence in two corpora under consideration. In a recent paper, Egbert and Biber (2019) advocated a different approach, one that involves computing log-likelihood ratios for word types based on the range of their distribution rather than their frequencies in the target and reference corpora under consideration. In this paper, I argue that their approach is a most welcome addition to keywords analysis but can still be profitably extended by utilizing both frequency and dispersion for keyness computations. I am presenting a new two-dimensional approach to keyness and exemplifying it on the basis of the Clinton-Trump Corpus and the British National Corpus.


2021 ◽  
Vol 9 (1) ◽  
pp. i-viii
Author(s):  
Tanja Säily ◽  
Jukka Tyrkkö

Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.


2021 ◽  
Vol 10 (1) ◽  
pp. 1-30
Author(s):  
Tieu-Thuy Chung ◽  
Luyen-Thi Bui ◽  
Peter Crosthwaite

Appraisal theory (Martin and White 2005), an approach to discourse analysis dealing with evaluative language, has been previously employed in analysing newspaper articles and spoken discourses in several earlier studies, although it is gaining in popularity as a framework for comparing first and second (L1/L2) writing. This study investigated 40 English majors’ Vietnamese and English paragraphs for evaluative language, a key component of successful academic writing, as realised under Appraisal theory. To this purpose, we collected L1 Vietnamese and L2 English data from the same student writers across the same topics and using a corpus-informed Contrastive Interlanguage Analysis approach to the annotation and analysis of appraisal. A range of commonalities were present in the use of appraisal across the two language varieties, while the results also suggest significant differences between students’ evaluative expressions in Vietnamese as a mother tongue and English as a second or foreign language. This variation includes the comparative under- and over-use of specific appraisal resources employed in L1 and L2 writing respectively, in particular, regarding writers’ employment of attitudinal features. The findings serve to inform future pedagogical applications regarding explicit instruction in stance and appraisal features for novice L2 English writers in Vietnam.


Export Citation Format

Share Document