scholarly journals From key words to key semantic domains

2008 ◽  
Vol 13 (4) ◽  
pp. 519-549 ◽  
Author(s):  
Paul Rayson

This paper reports the extension of the key words method for the comparison of corpora. Using automatic tagging software that assigns part-of-speech and semantic field (domain) tags, a method is described which permits the extraction of key domains by applying the keyness calculation to tag frequency lists. The combination of the key words and key domains methods is shown to allow macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) and thereby suggesting those linguistic features which should be investigated further. The resulting ‘data-driven’ approach presented here combines elements of both the ‘corpus-based’ and ‘corpus-driven’ paradigms in corpus linguistics. A web-based tool, Wmatrix, implementing the proposed method is applied in a case study: the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties.

2014 ◽  
Vol 2014 ◽  
pp. 1-12 ◽  
Author(s):  
Aaron L.-F. Han ◽  
Derek F. Wong ◽  
Lidia S. Chao ◽  
Liangye He ◽  
Yi Lu

With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability. Thirdly, the employed reference translations are very expensive and sometimes not available in the practice. In this paper, the authors propose an unsupervised MT evaluation metric using universal part-of-speech tagset without relying on reference translations. The authors also explore the performances of the designed metric on traditional supervised evaluation tasks. Both the supervised and unsupervised experiments show that the designed methods yield higher correlation scores with human judgments.


Author(s):  
Maryna Baklanova ◽  
Oleksandra Popova

This article is devoted to the problem dealing with the reproduction of communicative semantics while translating English, Chinese economic and political texts into Ukrainian. The content and structure of simultaneous translation were analysed. A contrastive analysis of the linguistic features of the English, Chinese and Ukrainian communicative semantics was made. Some tactics enabling the reproduction of the texts under research into the Ukrainian language within simultaneous translation were specified. Key words: simultaneous translation, transformations, the Chinese language, the English language, the Ukrainian language, speech tempo, time frame.


Author(s):  
Valentina Kisil ◽  
Svitlana Yukhymets

The article is devoted to the study of the peculiarities of the translation of terminology on the material of the English business discourse into Ukrainian and Chinese. The study represents the main approach to the definitions of such concepts as “business discourse” and “translation operation” in current language- and translational studies; the linguistic features of business discourse are analyzed; the translation operations applied at the lexical-semantic and structural component levels when translating English terms of business discourse into Ukrainian and Chinese are analyzed; the choice of translation operations when translating the terms of English discourse as a method of achieving an adequate translation. Key words: business discourse, translation operation, terminology, a term, the Chinese language.


Author(s):  
Una Stojnić

On the received view, the resolution of context-sensitivity is at least partly determined by non-linguistic features of utterance situation. If I say ‘He’s happy’, what ‘he’ picks out is underspecified by its linguistic meaning, and is only fixed through extra-linguistic supplementation: the speaker’s intention, and/or some objective, non-linguistic feature of the utterance situation. This underspecification is exhibited by most context-sensitive expressions, with the exception of pure indexicals, like ‘I.’ While this received view is prima facie appealing, I argue it is deeply mistaken. I defend an account according to which context-sensitivity resolution is governed by linguistic mechanisms determining prominence of candidate resolutions of context-sensitive items. Thus, on this account, the linguistic meaning of a context-sensitive expression fully specifies its resolution in a context, automatically selecting the resolution antecedently set by the prominence-governing linguistic mechanisms.


Author(s):  
Vittoria Cuteri ◽  
Giulia Minori ◽  
Gloria Gagliardi ◽  
Fabio Tamburini ◽  
Elisabetta Malaspina ◽  
...  

Abstract Purpose Attention has recently been paid to Clinical Linguistics for the detection and support of clinical conditions. Many works have been published on the “linguistic profile” of various clinical populations, but very few papers have been devoted to linguistic changes in patients with eating disorders. Patients with Anorexia Nervosa (AN) share similar psychological features such as disturbances in self-perceived body image, inflexible and obsessive thinking and anxious or depressive traits. We hypothesize that these characteristics can result in altered linguistic patterns and be detected using the Natural Language Processing tools. Methods We enrolled 51 young participants from December 2019 to February 2020 (age range: 14–18): 17 girls with a clinical diagnosis of AN, and 34 normal-weighted peers, matched by gender, age and educational level. Participants in each group were asked to produce three written texts (around 10–15 lines long). A rich set of linguistic features was extracted from the text samples and the statistical significance in pinpointing the pathological process was measured. Results Comparison between the two groups showed several linguistics indexes as statistically significant, with syntactic reduction as the most relevant trait of AN productions. In particular, the following features emerge as statistically significant in distinguishing AN girls and their normal-weighted peers: the length of the sentences, the complexity of the noun phrase, and the global syntactic complexity. This peculiar pattern of linguistic erosion may be due to the severe metabolic impairment also affecting the central nervous system in AN. Conclusion These preliminary data showed the existence of linguistic parameters as probable linguistic markers of AN. However, the analysis of a bigger cohort, still ongoing, is needed to consolidate this assumption. Level of evidence III Evidence obtained from case–control analytic studies.


2014 ◽  
Vol 7 (2) ◽  
pp. 205
Author(s):  
Ulin Nuha

In this study, The researcher analyzed the transactional andinterpersonal conversation texts found in grade VIII English textbookentitled ―EOS English on Sky 2‖ and I also analyzed the linguisticfeatures of the transactional and interpersonal conversations in theEnglish textbook. This study focuses on the issues of structuralfunctionalapproach which analyzes the speech function, structuralapproach which analyzes linguistic features. This is a qualitative study.In calculating the data and the final result of data percentage,quantification was used to support this study. Units of analysis in thisstudy are moves and clauses. The conversation texts are presented in 8units. The moves were analyzed functionally and the clauses wereanalyzed structurally. The result shows that the speech functions of thetransactional conversation texts are 54.5% matching the standard ofcontent, the speech functions of the interpersonal conversation texts are2.1% matching the standard of content. The linguistic feature applied inthe transactional and interpersonal conversation texts uses the linguisticfeature in functional literacy level. The speech functions of conversationtexts introduced in EOS English on Sky 2 for junior high school grade VIII are less compatible with the standard of content based on thecompatibility levels. Keywords: Transactional and interpersonal conversation texts; Speech function; linguistic feature. 


2018 ◽  
Vol 6 (2) ◽  
pp. 83
Author(s):  
Refat Aljumily

The aim of this paper was to evaluate the efficiency of automated linguistic features to test its capacity or discriminating power as style markers for author identification in short text messages of the Facebook genre. The corpus used to evaluate the automated linguistics features was compiled from 221 Facebook texts (each text is about 2 to 3 lines/35-40 words) written in English, which were written in the same genre and topic and posted in the same year group, totaling 7530 words. To compose the dataset for linguistic features performance or evaluation, frequency values were collected from 16 linguistic feature types involving parts of speech, function words, word bigrams, character tri grams, average sentence length in terms of words, average sentence length in terms of characters, Yule’s K measure, Simpson’s D measure, average words length, FW/CW ratio, average characters, content specific key words, type/token ratio, total number of short words less than four characters, contractions, and total number of characters in words which were selected from five corpora, totalling 328 test features. The evaluation of the 16 linguistic feature types differ from those of other analyses because the study used different variable selection methods including feature type frequency, variance, term frequency/ inverse document frequency (TF.IDF), signal-noise ratio, and Poisson term distribution. The relationships between known and anonymous text messages were examined using hierarchical linear and non-hierarchical nonlinear clustering methods, taking into accounts the nonlinear patterns among the data. There were similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms function word and parts of speech usages based on TF.IDF technique and the efficiency of function word usages (=60%) and the efficiency of parts of speech frequencies (=50%). There were no similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms of the other features using feature type frequency and variance techniques in this test and the efficiency of these features in the corpus (< 40%). There was a positive effect on identification performance using parts of speech and function word frequency usages and applying TF.IDF technique as the length of text messages increased (N≥ 100). Through this way, the performance and efficiency of syntactic features and function word usages to identify anonymous authors or text messages is improved by increasing the length of the text messages using TF.IDF variable selection technique, but decreased as feature type frequency and variance techniques in the selection process apply.


Data ◽  
2020 ◽  
Vol 5 (3) ◽  
pp. 60
Author(s):  
Nasser Alshammari ◽  
Saad Alanazi

This article outlines a novel data descriptor that provides the Arabic natural language processing community with a dataset dedicated to named entity recognition tasks for diseases. The dataset comprises more than 60 thousand words, which were annotated manually by two independent annotators using the inside–outside (IO) annotation scheme. To ensure the reliability of the annotation process, the inter-annotator agreements rate was calculated, and it scored 95.14%. Due to the lack of research efforts in the literature dedicated to studying Arabic multi-annotation schemes, a distinguishing and a novel aspect of this dataset is the inclusion of six more annotation schemes that will bridge the gap by allowing researchers to explore and compare the effects of these schemes on the performance of the Arabic named entity recognizers. These annotation schemes are IOE, IOB, BIES, IOBES, IE, and BI. Additionally, five linguistic features, including part-of-speech tags, stopwords, gazetteers, lexical markers, and the presence of the definite article, are provided for each record in the dataset.


2002 ◽  
Vol 29 (2) ◽  
pp. 449-488 ◽  
Author(s):  
DOUGLAS BIBER ◽  
RANDI REPPEN ◽  
SUSAN CONRAD

In their conceptual framework for linguistic literacy development, Ravid & Tolchinsky synthesize research studies from several perspectives. One of these is corpus-based research, which has been used for several large-scale research studies of spoken and written registers over the past 20 years. In this approach, a large, principled collection of natural texts (a ‘corpus’) is analysed using computational and interactive techniques, to identify the salient linguistic characteristics of each register or text variety. Three characteristics of corpus-based analysis are particularly important (see Biber, Conrad & Reppen 1998):[bull ] a special concern for the representativeness of the text sample being analysed, and for the generalizability of findings;[bull ] overt recognition of the interactions among linguistic features: the ways in which features co-occur and alternate;[bull ] a focus on register as the most important parameter of linguistic variation: strong patterns of use in one register often represent only weak patterns in other registers.


Author(s):  
Cristina Crocamo ◽  
Marco Viviani ◽  
Francesco Bartoli ◽  
Giuseppe Carrà ◽  
Gabriella Pasi

Binge Drinking (BD) is a common risky behaviour that people hardly report to healthcare professionals, although it is not uncommon to find, instead, personal communications related to alcohol-related behaviors on social media. By following a data-driven approach focusing on User-Generated Content, we aimed to detect potential binge drinkers through the investigation of their language and shared topics. First, we gathered Twitter threads quoting BD and alcohol-related behaviours, by considering unequivocal keywords, identified by experts, from previous evidence on BD. Subsequently, a random sample of the gathered tweets was manually labelled, and two supervised learning classifiers were trained on both linguistic and metadata features, to classify tweets of genuine unique users with respect to media, bot, and commercial accounts. Based on this classification, we observed that approximately 55% of the 1 million alcohol-related collected tweets was automatically identified as belonging to non-genuine users. A third classifier was then trained on a subset of manually labelled tweets among those previously identified as belonging to genuine accounts, to automatically identify potential binge drinkers based only on linguistic features. On average, users classified as binge drinkers were quite similar to the standard genuine Twitter users in our sample. Nonetheless, the analysis of social media contents of genuine users reporting risky behaviours remains a promising source for informed preventive programs.


Sign in / Sign up

Export Citation Format

Share Document