Corpus-based Studies of Lexical and Semantic Variation: The Importance of Both Corpus Size and Corpus Design

Corpora ◽  
2020 ◽  
Vol 15 (2) ◽  
pp. 125-140
Author(s):  
Yukiko Ohashi ◽  
Noriaki Katagiri ◽  
Katsutoshi Oka ◽  
Michiko Hanada

This paper reports on two research results: ( 1) designing an English for Specific Purposes (esp) corpus architecture complete with annotations structured by regular expressions; and ( 2) a case study to test the design to cater for creating a specific vocabulary list using the compiled corpus. The first half of this study involved designing a precisely structured esp corpus from 190 veterinary medical charts with a hierarchy of the data. The data hierarchy in the corpus consists of document types, outline elements and inline elements, such as species and breed. Perl scripts extracted the data attached to veterinary-specific categories, and the extraction led to creating wordlists. The second part of the research tested the corpus mode, creating a list of commonly observed lexical items in veterinary medicine. The coverage rate of the wordlists by General Service List (gsl) and Academic Word List (awl) was tested, with the result that 66.4 percent of all lexical items appeared in gsl and awl, whereas 33.7 percent appeared in none of those lists. The corpus compilation procedures as well as the annotation scheme introduced in this study enable the compilation of specific corpora with explicit annotations, allowing teachers to have access to data required for creating esp classroom materials.


2004 ◽  
Author(s):  
J. Bruce Millar ◽  
Michael Wagner ◽  
Roland Goecke

2021 ◽  
Vol 12 (4) ◽  
pp. 612-648
Author(s):  
Johannes Scherling

Abstract For a few decades now and most prominently promoted by the US, neoliberal economics have been on the rise, epitomized in recent austerity policies with regard to countries that have met financial trouble. In particular the drive for privatization of core public services relating to basic human needs, such as water, social services or pensions, has been increasingly criticized because of a perceived incompatibility between the profit motive and social solidarity. This article uses a corpus-based analysis of the discourse on privatization in the US of proponents supporting, respectively opposing it, with an overall corpus size of about 230,000 tokens. It examines how the two groups conceptualize privatization differently and which strategies are applied to fore- or background particular aspects of it.


Author(s):  
Miroslav Kubát ◽  
Jan Hůla ◽  
Xinying Chen ◽  
Radek Čech ◽  
Jiří Milička

AbstractThis is a pilot study of usability of Context Specificity measure for stylometric purposes. Specifically, the word embedding Word2vec approach based on measuring lexical context similarity between lemmas is applied to the analysis of texts that belong to different styles. Three types of Czech texts are investigated: fiction, non-fiction, and journalism. Specifically, forty lemmas were observed (10 lemmas each for verbs, nouns, adjectives, and adverbs). The aim of the present study is to introduce a concept of the Context Specificity and to test whether this measurement is sensitive to different styles. The results show that the proposed method Closest Context Specificity (CCS) is a corpus size independent method which has a promising potential in analyzing different styles.


Sign in / Sign up

Export Citation Format

Share Document