Research in Corpus Linguistics

Location detection in social-media microtexts is an important natural language processing task for emergency-based contexts where locative references are identified in text data. Spatial information obtained from texts is essential to understand where an incident happened, where people are in need of help and/or which areas have been affected. This information contributes to raising emergency situation awareness, which is then passed on to emergency responders and competent authorities to act as quickly as possible. Annotated text data are necessary for building and evaluating location-detection systems. The problem is that available corpora of tweets for location-detection tasks are either lacking or, at best, annotated with coarse-grained location types (e.g. cities, towns, countries, some buildings, etc.). To bridge this gap, we present our semi-automatically annotated corpus, the Fine-Grained LOCation Tweet Corpus (FGLOCTweet Corpus), an English tweet-based corpus for fine-grained location-detection tasks, including fine-grained locative references (i.e. geopolitical entities, natural landforms, points of interest and traffic ways) together with their surrounding locative markers (i.e. direction, distance, movement or time). It includes annotated tweet data for training and evaluation purposes, which can be used to advance research in location detection, as well as in the study of the linguistic representation of place or of the microtext genre of social media.

How is information content distributed in RA introductions across disciplines? An entropy-based approach

Research in Corpus Linguistics ◽

10.32714/ricl.10.01.04 ◽

2022 ◽

Vol 10 (1) ◽

pp. 63-83

Author(s):

Wei Xiao ◽

Jin Liu ◽

Li Li

Keyword(s):

Social Sciences ◽

Information Content ◽

Academic Writing ◽

Natural Sciences ◽

Linguistic Features ◽

Corpus Linguistic ◽

Research Article ◽

Based Instruction ◽

Linguistic Methods ◽

Interest In Research

Recent years have witnessed a growing interest in research article (RA thereafter) introductions. Most previous studies focused on the macro structures, rhetorical functions and linguistic realizations of RA introductions, but few intended to investigate the information content distribution from the perspective of information theory. The current study conducted an entropy-based study on the distributional patterns of information content in RA introductions and their variations across disciplines (humanities, natural sciences, and social sciences). Three indices, that is, one-, two-, and three-gram entropies, were used to analyze 120 RA introductions (40 introductions from each disciplinary area). The results reveal that, first, in RA introductions, the information content is unevenly distributed, with the information content of Move 1 being the highest, followed in sequence by Move 3 and Move 2; second, the three entropy indices may reflect different linguistic features of RA introductions; and, third, disciplinary variations of information content were found. In Move 1, the RA introductions of natural sciences are more informative than those of the other two disciplines, and in Move 3 the RA introductions of social sciences are more informative as well. This study has implications for genre-based instruction in the pedagogy of academic writing, as well as the broadening of the applications of quantitative corpus linguistic methods into less touched fields.

Libya, the media and the language of violence: A Corpus-Assisted Discourse Analysis

Research in Corpus Linguistics ◽

10.32714/ricl.10.01.05 ◽

2022 ◽

Vol 10 (1) ◽

pp. 84-116

Author(s):

Safa Attia

Keyword(s):

Discourse Analysis ◽

Frequency Distribution ◽

Media Coverage ◽

Al Jazeera ◽

News Values ◽

Story Content ◽

The World ◽

The Media ◽

Similarities And Differences ◽

Al Arabiya

The Arab revolution euphoria of 2011 was covered around the clock by different media sites, engaging millions of followers around the world, and eventually turning into discontent in some affected countries. This study examines the outcomes of the Libyan uprising (2011–2015), specifically the topics of civil-war and terrorism, through the lenses of the Arab written media in Arabic (Al Jazeera and Al Arabiya), the Arab written media in English (Al Jazeera and Al Arabiya), and the Western written media in English (BBC and CNN). Through Corpus-Assisted Discourse Analysis (CADS), integrating discursive news values analysis (DNVA), this study highlights the ideological representations of these media, and examines their similarities and differences in terms of frequency distribution and story content. The findings indicate that the media coverage of the outcomes of the Libyan Revolution, when reporting on the topics of war and terrorism, follow similar directions in the story content and the frequency distribution, with some differences in the latter between the analysed media sites. Also, the collocations, concordances, and DNVA results, especially NEGATIVITY, IMPACT and ELITENESS, prove the emphasis of the media on violent language, making terrorism appear the norm, and thus manipulating the audience and affecting their understanding of the news.

Linguistic democratization in HKE across registers: The effects of prescriptivism

Research in Corpus Linguistics ◽

10.32714/ricl.09.02.04 ◽

2021 ◽

Vol 9 (2) ◽

pp. 64-89

Author(s):

Lucía Loureiro-Porto

Keyword(s):

Twentieth Century ◽

Hong Kong ◽

The Other ◽

Cultural Norms ◽

Outer Circle ◽

Inner Circle ◽

The One ◽

Hong Kong English

The second half or the twentieth century witnessed the emergence and expansion of linguistic changes associated to a number of processes related to changes in socio-cultural norms, such as colloquialization, informalization and democratization. This paper focuses on the latter, a phenomenon that has been claimed to be responsible for several ongoing changes in inner-circle varieties of English, but is rather unexplored in outer-circle varieties. The paper explores Hong Kong English and studies two linguistic sets of markers that include items that represent the (old) undemocratic alternative and the (new) democratic option, namely modal must vs. semi-modals have (got) to, need (to) and want to, and epicene pronouns including undemocratic generic he, on the one hand, and democratic singular they and conjoined he or she, on the other. Using the Hong Kong component of the International Corpus of English, and adopting a register approach, the paper reaches conclusions regarding the role played by prescriptivism in the diffusion of democratic items.

Corpus Linguistics and Eighteenth Century Collections Online (ECCO)

Research in Corpus Linguistics ◽

10.32714/ricl.09.01.03 ◽

2021 ◽

Vol 9 (1) ◽

pp. 19-34

Author(s):

Mikko Tolonen ◽

Eetu Mäkelä ◽

Ali Ijaz ◽

Leo Lahti

Keyword(s):

Eighteenth Century ◽

Corpus Linguistics ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Source ◽

Optical Character ◽

Key Aspects ◽

Machine Readable ◽

Machine Readable Form

Eighteenth Century Collections Online (ECCO) is the most comprehensive dataset available in machine-readable form for eighteenth-century printed texts. It plays a crucial role in studies of eighteenth-century language and it has vast potential for corpus linguistics. At the same time, it is an unbalanced corpus that poses a series of different problems. The aim of this paper is to offer a general overview of ECCO for corpus linguistics by analysing, for example, its publication countries and languages. We will also analyse the role of the substantial number of reprints and new editions in the data, discuss genres and the estimates of Optical Character Recognition (OCR) quality. Our conclusion is that whereas ECCO provides a valuable source for corpus linguistics, scholars need to pay attention to historical source criticism. We have highlighted key aspects that need to be taken into consideration when considering its possible uses.

Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2

Research in Corpus Linguistics ◽

10.32714/ricl.09.01.04 ◽

2021 ◽

Vol 9 (1) ◽

pp. 35-62

Author(s):

Nele Põldvere ◽

Johan Frid ◽

Victoria Johansson ◽

Carita Paradis

Keyword(s):

Corpus Linguistics ◽

Personal Information ◽

Automatic Segmentation ◽

Careful Consideration ◽

Lexical Information ◽

The Public ◽

Valuable Complement ◽

Corpus Data ◽

Audio Files ◽

New London

This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material.

The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)

Research in Corpus Linguistics ◽

10.32714/ricl.09.01.07 ◽

2021 ◽

Vol 9 (1) ◽

pp. 104-131

Author(s):

Lassi Saario ◽

Tanja Säily ◽

Samuli Kaislaniemi ◽

Terttu Nevalainen

Keyword(s):

Factors Affecting ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Accurate Normalisation ◽

Spelling Variation ◽

Early English ◽

Internal And External Factors ◽

Speech Tagging ◽

Late Modern

This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspondence Extension (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa 3.7pp. The combination of POS tagging and social metadata will make the corpus attractive to linguists interested in the interplay between language-internal and -external factors affecting variation and change.

Review of Blanco, Marta, Hella Olbertz and Victoria Vázquez Rozas eds. 2019. Corpus y Construcciones: Perspectivas Hispánicas. (Verba: Anexo 79). Santiago de Compostela: Universidade de Santiago de Compostela. ISBN: 978-8-417-59587-6. https://dx.doi.org/10.15304/9788417595876

Research in Corpus Linguistics ◽

10.32714/ricl.09.02.12 ◽

2021 ◽

Vol 9 (2) ◽

pp. 201-210

Author(s):

Miriam Thegel

Keyword(s):

Santiago De Compostela

A new approach to (key) keywords analysis: Using frequency, and now also dispersion

Research in Corpus Linguistics ◽

10.32714/ricl.09.02.02 ◽

2021 ◽

Vol 9 (2) ◽

pp. 1-33

Author(s):

Stefan Th. Gries

Keyword(s):

Statistical Measure ◽

Dimensional Approach ◽

Likelihood Ratios ◽

New Approach ◽

Text Type ◽

Corpus Linguistic ◽

Log Likelihood ◽

British National Corpus ◽

Linguistic Approaches ◽

National Corpus

A widely-used method in corpus-linguistic approaches to discourse analysis, register/text type/genre analysis, and educational/curriculum questions is that of keywords analysis, a simple statistical method aiming to identify words that are key to, i.e. characteristic for, certain discourses, text types, or topic domains. The vast majority of keywords analyses relied on the same statistical measure that most collocation studies are using, the log-likelihood ratio, which is performed on frequencies of occurrence in two corpora under consideration. In a recent paper, Egbert and Biber (2019) advocated a different approach, one that involves computing log-likelihood ratios for word types based on the range of their distribution rather than their frequencies in the target and reference corpora under consideration. In this paper, I argue that their approach is a most welcome addition to keywords analysis but can still be profitably extended by utilizing both frequency and dispersion for keyness computations. I am presenting a new two-dimensional approach to keyness and exemplifying it on the basis of the Clinton-Trump Corpus and the British National Corpus.

Multimodal meaning making: The annotation of nonverbal elements in multimodal corpus transcription

Research in Corpus Linguistics ◽

10.32714/ricl.09.01.05 ◽

2021 ◽

Vol 10 (1) ◽

pp. 63-88

Author(s):

Marie-Louise Brunner ◽

Stefan Diemer

Keyword(s):

Meaning Making ◽

Spoken Language ◽

Multimodal Data ◽

Spoken Discourse ◽

Annotation System ◽

Quantitative And Qualitative Research ◽

Wide Range ◽

Multimodal Corpus ◽

Suitable Standard ◽

Structured Approach

The article discusses how to integrate annotation for nonverbal elements (NVE) from multimodal raw data as part of a standardized corpus transcription. We argue that it is essential to include multimodal elements when investigating conversational data, and that in order to integrate these elements, a structured approach to complex multimodal data is needed. We discuss how to formulate a structured corpus-suitable standard syntax and taxonomy for nonverbal features such as gesture, facial expressions, and physical stance, and how to integrate it in a corpus. Using corpus examples, the article describes the development of a robust annotation system for spoken language in the corpus of Video-mediated English as a Lingua Franca Conversations (ViMELF 2018) and illustrates how the system can be used for the study of spoken discourse. The system takes into account previous research on multimodality, transcribes salient nonverbal features in a concise manner, and uses a standard syntax. While such an approach introduces a degree of subjectivity through the criteria of salience and conciseness, the system also offers considerable advantages: it is versatile and adaptable, flexible enough to work with a wide range of multimodal data, and it allows both quantitative and qualitative research on the pragmatics of interaction.

Research in Corpus Linguistics
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Research In Corpus Linguistics

The FGLOCTweet Corpus: An English tweet-based corpus for fine-grained location-detection tasks

How is information content distributed in RA introductions across disciplines? An entropy-based approach

Libya, the media and the language of violence: A Corpus-Assisted Discourse Analysis

Linguistic democratization in HKE across registers: The effects of prescriptivism

Corpus Linguistics and Eighteenth Century Collections Online (ECCO)

Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2

The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)

Review of Blanco, Marta, Hella Olbertz and Victoria Vázquez Rozas eds. 2019. Corpus y Construcciones: Perspectivas Hispánicas. (Verba: Anexo 79). Santiago de Compostela: Universidade de Santiago de Compostela. ISBN: 978-8-417-59587-6. https://dx.doi.org/10.15304/9788417595876

A new approach to (key) keywords analysis: Using frequency, and now also dispersion

Multimodal meaning making: The annotation of nonverbal elements in multimodal corpus transcription

Export Citation Format

Research in Corpus LinguisticsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Research In Corpus Linguistics

The FGLOCTweet Corpus: An English tweet-based corpus for fine-grained location-detection tasks

How is information content distributed in RA introductions across disciplines? An entropy-based approach

Libya, the media and the language of violence: A Corpus-Assisted Discourse Analysis

Linguistic democratization in HKE across registers: The effects of prescriptivism

Corpus Linguistics and Eighteenth Century Collections Online (ECCO)

Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2

The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)

Review of Blanco, Marta, Hella Olbertz and Victoria Vázquez Rozas eds. 2019. Corpus y Construcciones: Perspectivas Hispánicas. (Verba: Anexo 79). Santiago de Compostela: Universidade de Santiago de Compostela. ISBN: 978-8-417-59587-6. https://dx.doi.org/10.15304/9788417595876

A new approach to (key) keywords analysis: Using frequency, and now also dispersion

Multimodal meaning making: The annotation of nonverbal elements in multimodal corpus transcription

Research in Corpus Linguistics
Latest Publications