scholarly journals Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

2021 ◽  
Vol 72 (2) ◽  
pp. 590-602
Author(s):  
Kirill I. Semenov ◽  
Armine K. Titizian ◽  
Aleksandra O. Piskunova ◽  
Yulia O. Korotkova ◽  
Alena D. Tsvetkova ◽  
...  

Abstract The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

Entropy ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 275
Author(s):  
Igor A. Bessmertny ◽  
Xiaoxi Huang ◽  
Aleksei V. Platonov ◽  
Chuqiao Yu ◽  
Julia A. Koroleva

Search engines are able to find documents containing patterns from a query. This approach can be used for alphabetic languages such as English. However, Chinese is highly dependent on context. The significant problem of Chinese text processing is the missing blanks between words, so it is necessary to segment the text to words before any other action. Algorithms for Chinese text segmentation should consider context; that is, the word segmentation process depends on other ideograms. As the existing segmentation algorithms are imperfect, we have considered an approach to build the context from all possible n-grams surrounding the query words. This paper proposes a quantum-inspired approach to rank Chinese text documents by their relevancy to the query. Particularly, this approach uses Bell’s test, which measures the quantum entanglement of two words within the context. The contexts of words are built using the hyperspace analogue to language (HAL) algorithm. Experiments fulfilled in three domains demonstrated that the proposed approach provides acceptable results.


2012 ◽  
Vol 56 (4) ◽  
pp. 998-1021 ◽  
Author(s):  
Miguel ángel Jiménez-Crespo ◽  
Maribel Tercedor

Localization is increasingly making its way into translation training programs at university level. However, there is still a scarce amount of empirical research addressing issues such as defining localization in relation to translation, what localization competence entails or how to best incorporate intercultural differences between digital genres, text types and conventions, among other aspects. In this paper, we propose a foundation for the study of localization competence based upon previous research on translation competence. This project was developed following an empirical corpus-based contrastive study of student translations (learner corpus), combined with data from a comparable corpus made up of an original Spanish corpus and a Spanish localized corpus. The objective of the study is to identify differences in production between digital texts localized by students and professionals on the one hand, and original texts on the other. This contrastive study allows us to gain insight into how localization competence interrelates with the superordinate concept of translation competence, thus shedding light on which aspects need to be addressed during localization training in university translation programs.


2017 ◽  
pp. 35-46 ◽  
Author(s):  
Irene Doval

This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.


2010 ◽  
Vol 34 (1) ◽  
pp. 36-74 ◽  
Author(s):  
Nicole Dehé ◽  
Anne Wichmann

Sentence-initial pronoun-verb combinations such as I think, I believe are ambiguous between main clause use on the one hand and adverbial or discourse use on the other hand. We approach the topic from a prosodic perspective. Based on corpus data from spoken British English the prosodic patterns of sentence-initial I think and I believe are analysed and related to their interpretation in context. We show that these expressions may function as main clause (MC), comment clause (CC) or discourse markers (DM) and that the speaker’s choice is reflected in the prosody. The key feature is prosodic prominence: MCs are reflected by accent placement on the pronoun, CCs by an accent on the verb, while DMs are unstressed.


Author(s):  
Svetlana Shevchenko

The article deals with the interdiction convergence on the example of evolutionary changes in lexical semantics of poetic language. The current study contributes to the development of the methodology for studying the language evolutionary processes. The paper describes certain trends of dynamic changes and their specifics; it gives some prediction about the further lexis convergence of different types of functional styles. The findings contribute to the development of lexicography which is going to reflect not only static but also dynamic characteristics of lexical units including stylistic ones. The subjectivity of labeling poetic vocabulary in dictionaries can be partially removed through the analysis of corpus data by comparing frequency indices in different subsections, however this method is not always accurate, moreover, it doesnt effectively trace evolutionary changes. The data from the psycholinguistic experiments can help reveal the dynamics of changes. On the one hand, the results of scaling show the extent of poetry in connotative meanings; on the other hand, the open-response associative experiment allows us to calculate the archaization index of a lexeme through summing up the numerical values of certain selected parameters. The research gives obvious evidence of active archaization of some specific poetic lexemes. The findings also prove that the dynamic changes in stylistic connotation are not synchronous with the changes in the denotative layer of a lexical unit.


2021 ◽  
Vol 44 (3) ◽  
pp. 335-350
Author(s):  
Chuming Wang ◽  
Wei Hong

Abstract This study investigated the efficiency of learning the Chinese numeral classifiers by L2 Chinese learners by means of an alignment-oriented task. Participants were a total of 96 intermediate learners of L2 Chinese, who were randomly assigned to two experimental groups and one control group, with each group consisting of 32 participants. The continuation task used in this study consisted of a picture-based Chinese text depicting a room with an array of objects, which necessitates the use of classifiers. The two experimental groups were both required to first read the text and then write to describe their own rooms in comparison with the one in the text. One group was instructed to use the classifiers from the text as much as possible in their writing, whereas the other was not required to do so. Participants in the control group were first given the picture to look at in the absence of the text and then asked to describe their own rooms. The results showed that the continuation task significantly enhanced participants’ retention of the Chinese numeral classifiers, suggesting that the alignment-based approach is an effective way to learn difficult linguistic categories such as the Chinese classifiers.


1996 ◽  
Vol 2 (4) ◽  
pp. 355-364 ◽  
Author(s):  
EVA EJERHED

The paper presents background and motivation for a processing model that segments discourse into units that are simple, non-nested clauses, prior to the recognition of clause internal phrasal constituents, and experimental results in support of this model. One set of results is derived from a statistical reanalysis of the Swedish empirical data in Strangert, Ejerhed and Huber 1993 concerning the linguistic structure of major prosodic units. The other set of results is derived from experiments in segmenting part of speech annotated Swedish text corpora into clauses, using a new clause segmentation algorithm. The clause segmented corpus data is taken from the Stockholm Umeå Corpus (SUC), 1 M words of Swedish texts from different genres, part of speech annotated by hand, and from the Umeå corpus DAGENS INDUSTRI 1993 (DI93), 5 M words of Swedish financial newspaper text, processed by fully automatic means consisting of tokenizing, lexical analysis, and probabilistic POS tagging. The results of these two experiments show that the proposed clause segmentation algorithm is 96% correct when applied to manually tagged text, and 91% correct when applied to probabilistically tagged text.


Author(s):  
Michael D. K. Ing

This chapter explores irresolvable value conflicts with regard to sages. It begins with portrayals of early sages such as Yao and Shun as compromised figures in a broad array of early Chinese texts. This serves as a context for understanding how early Confucians stressed the virtuous nature of sages on the one hand and accepted portions of the broader discourse of compromise on the other. To illustrate this the chapter looks more closely at the case of Wu Wang 武王‎ and shows that early Confucian texts were ambivalent about his violent overthrow of the Shang dynasty. This chapter also looks at Kongzi and builds off the notion of transgression discussed in chapter 4 to show that he was also understood as a conflicted figure.


2021 ◽  
Vol 32 (2) ◽  
pp. 219-250
Author(s):  
Yu Fang ◽  
Haitao Liu

Abstract This paper investigates the effects of 10 factors on the choice between alternative ba sentences and SVO sentences in Mandarin Chinese. These factors are givenness, definiteness, animacy and pronominality of NP2s, NP2 length, VP length, verb sense, syntactic parallelism, dependency distance, and surprisal. Using corpus data and mixed-effects logistic regression modeling, we find that on the one hand, givenness, syntactic parallelism, and the log-transformed ratio of NP2 length and VP length are significant predictors of the choice between ba sentences and SVO sentences. A new NP2, a large length ratio and a parallel construction predict an SVO sentence rather than a ba sentence. On the other hand, dependency distance and surprisal estimated by the trigram model are effective in predicting the choice between naturally occurring ba/SVO sentences and their alternatives. Naturally occurring sentences are more likely to have shorter dependency distances and smaller surprisal values than the converted sentences. The effects of these five factors on syntactic choice are congruent with results of previous studies, which suggests that some determinants of syntactic choice are shared among languages.


Pragmatics ◽  
2019 ◽  
Vol 30 (1) ◽  
pp. 142-168 ◽  
Author(s):  
Dániel Z. Kádár ◽  
Juliane House

Abstract Our study provides a corpus-based contrastive pragmatic investigation of the expressions please in English and qing 请 in Chinese. We define such expressions as ‘ritual frame indicating expressions’ (henceforth RFIEs) and argue that RFIEs are deployed in settings where it is important to show awareness of the rights and obligations. ‘Ritual frame’ encompasses a cluster of standard situations. On the one hand the corpus-based investigation of ritual provides an innovative complement to sociopragmatic approaches to ritual behaviour because they reveal how RFIEs that indicate ritual spread across a cluster of standard situations. On the other hand, it allows the researcher to contrast the scope of ritual across lingua-cultures by comparatively looking into the standard situations in which a particular RFIE is deployed. Findings of our data analysis point to intriguing differences between English and Chinese RFIEs, as well as relevant lingua-cultural reasons behind such differences.


Sign in / Sign up

Export Citation Format

Share Document