Corpora
Latest Publications


TOTAL DOCUMENTS

311
(FIVE YEARS 74)

H-INDEX

21
(FIVE YEARS 3)

Published By Edinburgh University Press

1755-1676, 1749-5032

Corpora ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 337-348
Author(s):  
Paulo Almeida ◽  
Manuel Marques-Pita ◽  
Joana Gonçalves-Sá

In a representative democracy, some decide in the name of the rest. These elected officials are commonly gathered in public assemblies, such as parliaments, where they discuss policies, legislate, and vote on fundamental initiatives. A core aspect of such democratic processes are the plenary debates, where important public discussions take place. Many parliaments around the world are increasingly keeping the transcripts of such debates and other parliamentary data in digital formats that are accessible to the public. This is meant to increase transparency and accountability, but these records are often only provided as raw text or even as images, with little, if any, annotation and in inconsistent formats, making them difficult to analyse. Here, we present ptarl-d, an annotated corpus of debates in the Portuguese Parliament, covering the years 1976 to 2019 and representing the entire period of Portuguese democracy.


Corpora ◽  
2021 ◽  
Vol 16 (3) ◽  

Corpora ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 379-416
Author(s):  
Tatyana Karpenko-Seccombe

This paper considers the role of historical context in initiating shifts in word meaning. The study focusses on two words – the translation equivalents separatist and separatism – in the discourses of Russian and Ukrainian parliamentary debates before and during the Russian–Ukrainian conflict which emerged at the beginning of 2014. The paper employs a cross-linguistic corpus-assisted discourse analysis to investigate the way wider socio-political context affects word usage and meaning. To allow a comparison of discourses around separatism between two parliaments, four corpora were compiled covering the debates in both parliaments before and during the conflict. Keywords, collocations and n-grams were studied and compared, and this was followed by qualitative analysis of concordance lines, co-text and the larger context in which these words occurred. The results show how originally close meanings of translation equivalents began to diverge and manifest noticeable changes in their connotative, affective and, to an extent, denotative meanings at a time of conflict in line with the dominant ideologies of the parliaments as well as the political affiliations of individuals.


Corpora ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 305-335
Author(s):  
Andrew Hardie ◽  
Wesam Ibrahim

Arabic syntax has yet to be studied in detail from a corpus-based perspective. The Arabic copula kāna (‘be’), functions also as an auxiliary, creating periphrastic tense–aspect constructions; but the literature on these functions is far from exhaustive. To analyse kāna within the one-million word Corpus of Contemporary Arabic, part-of-speech tagging (using novel, targeted enhancements to a previously described program which improves the accessibility for linguistic analysis of the output of Habash et al.’s [2012] mada disambiguator for the Buckwalter Arabic morphological analyser) is applied to disambiguate copula and auxiliary at a high rate of accuracy. Concordances of both are extracted, and 10 percent samples (499 instances of copula kāna and 387 of auxiliary kāna) are analysed manually to identify surface-level grammatical patterns and meanings. This raw analysis is then systematised according to the more general patterns’ main parameters of variation; special descriptions are developed for specific, apparently fixed-form expressions (including two phraseologies which afford expression of verbal and adjectival modality). Overall, we uncover substantial new detail, not mentioned in existing grammars (e.g., the quantitative predominance of the past imperfect construction over other uses of auxiliary kāna). There exists notable potential for these corpus-based findings to inform and enhance not only grammatical descriptions but also pedagogy of Arabic as a first or second/foreign language.


Corpora ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 417-445
Author(s):  
Josip Batinić ◽  
Elena Frick ◽  
Thomas Schmidt

In this paper, we present an overview of freely available web applications providing online access to spoken language corpora. We explore and discuss various solutions with which the corpus providers and corpus platform developers address the needs of researchers who are working with spoken language. The paper aims to contribute to the long-overdue exchange and discussion of methods and best practices in the design of online access to spoken language corpora.


Corpora ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 349-378
Author(s):  
Adrian Yip

Sport is a powerful social institution where hegemonic masculinity is constantly constructed and naturalised through the positioning of physicality and athleticism alongside maleness. Female athletes continue to be sub-ordinated by means of under-representation and trivialising gender discourses. So far, the extensive discussion of gendered language in sports media has primarily focussed on identifying the manifestations of gender bias in traditional news media. There has been little endeavour to explore the language of online media and tournament organisers. This study addresses that gap by comparing online gender representations of tennis players during the Wimbledon Championships 2018 on five online news websites and the tournament website. It also contributes to existing literature by providing corpus evidence of gender bias in sports media. The corpus consists of 1,622 articles (1,076,475 tokens). Findings from frequency, collocation and concordance analysis indicate that despite some instances of gender-neutral representations, female players are prone to gender marking and gender-bland sexism on all websites. I argue that the challenges women face relate to the tension between femininity and athleticism, and the misguided belief that women need to but can never eliminate the muscle gap.


Corpora ◽  
2021 ◽  
Vol 16 (3) ◽  

Corpora ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. 191-203
Author(s):  
Timo Korkiakangas

This paper describes the construction and annotation of the Late Latin Charter Treebank, a set of three dependency treebanks (llct1, llct2 and llct3) which together contain 1,261 Early Medieval Latin documentary texts (i.e., original charters) written in Italy between ad 714 and 1000 (about 594,000 tokens). The paper focusses on matters which a linguistically or philologically inclined user of llct needs to know: the criteria on which the charters were selected, the special characteristics of the annotation types utilised, and the geographical and chronological distribution of the data. In addition to normal queries on forms, lemmas, morphology and syntax, complex philological research settings are enabled by the textual annotation layer of llct, which indicates abbreviated and damaged words, as well as the formulaic and non-formulaic passages of each charter.


Corpora ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. 287-299
Author(s):  
Prihantoro  

MorphInd 2 ( Larasati et al., 2011 ) is a state-of-the-art morphological analyser for Indonesian. To date, there has not been any comprehensive evaluation of the morphological annotation scheme which MorphInd implements. My evaluation of this annotation scheme reveals a number of significant drawbacks. Some analytical features encoded in MorphInd's tagset seem not to reflect features actually present in Indonesian morphology, while certain common features in the analysis of Indonesian are absent. Likewise, the Part of Speech (pos) hierarchy in the MorphInd tagset does not reflect the usual pos hierarchy used by Indonesian reference grammars. Moreover, the MorphInd output does not link morphological tags to the corresponding morpheme. Finally, a number of issues which might problematise text/corpus querying in the annotation's layout are observable, particularly relating to affixes, reduplication, and the affix–reduplication interface.


Sign in / Sign up

Export Citation Format

Share Document