Computational Linguistics and Intellectual Technologies
Latest Publications


TOTAL DOCUMENTS

79
(FIVE YEARS 79)

H-INDEX

1
(FIVE YEARS 1)

Published By Russian State University For The Humanities

9785728129462

Author(s):  
A. A. Goncharov ◽  
◽  
O. Yu. Inkova ◽  
◽  

One of the main characteristics of logical-semantic relations (LSRs) between two fragments of a text is that these relations can be either explicit (expressed by some marker, e.g. a connective) or implicit (derived from the interrelation of these fragments’ semantics). Since implicit LSRs do not have any marker, it is difficult to find them in a text (whether automatically or not). In this paper, approaches to analysing implicit LSRs are compared, an original definition for them is offered and differences between implicit LSRs and LSRs expressed by non-prototypical means are described. A method is proposed to identify implicit LSRs using a parallel corpus and a supracorpora database of connectives. Based on the well-known statement that LSRs can be explicitated by adding connectives in the translation, it is argued here that through selecting pairs in which fragments where a connective is used to express an LSR in the translation correspond to those containing any of the translation stimuli standard for this connective in the source language, it is possible to get an array of contexts in which this LSR is implicit in the source text (or expressed by means other than connectives). This method is then applied to study the French causal connectives car, parce que and puisque using a Russian-French parallel corpus. The corpus data are analysed to obtain information about LSRs particularly about cases where the causal LSR in Russian is implicit, as well as about the use of causal connectives in French. These results are used to show that the method proposed allows to quickly create a representative array of contexts with implicit LSRs, which can be useful in both text analysis and in machine learning.


Author(s):  
N. V. Remnev ◽  

The task of recognizing the author’s native (Native Language Identification—NLI) language based on a texts, written in a language that is non-native to the author—is the task of automatically recognizing native language (L1). The NLI task was studied in detail for the English language, and two shared tasks were conducted in 2013 and 2017, where TOEFL English essays and essay samples were used as data. There is also a small number of works where the NLI problem was solved for other languages. The NLI problem was investigated for Russian by Ladygina (2017) and Remnev (2019). This paper discusses the use of well-established approaches in the NLI Shared Task 2013 and 2017 competitions to solve the problem of recognizing the author’s native language, as well as to recognize the type of speaker—learners of Russian or Heritage Russian speakers. Native language identification task is also solved based on the types of errors specific to different languages. This study is data-driven and is possible thanks to the Russian Learner Corpus developed by the Higher School of Economics (HSE) Learner Russian Research Group on the basis of which experiments are being conducted.


Author(s):  
V. F. Vydrin ◽  
◽  
J. J. Méric ◽  

A model for the development of a corpus-driven spelling dictionary for the Bambara language is described. First, a list of about 4,000 lexemes characterized by spelling variability is extracted from an electronic BambaraFrench dictionary. At the next stage, a script is applied to determine the number of occurrences of each spelling variant in the Bambara Reference Corpus, separately for the entire Corpus (more than 11 million words) and for its disambiguated subcorpus (about 1.5 million words). Statistics on the diversity of sources and authors are also obtained automatically. The statistical data are then sorted manually into two lists of lexemes: those whose standard spelling can be established statistically, and those requiring evaluation by expert linguists. Some difficult cases are discussed in the paper. At the final stage, a representative expert commission will discuss all those lexemes for which statistical data alone do not suffice to define a standard spelling variant, before taking a final decision on each. The resulting Bambara spelling dictionary will be published electronically and on paper.


Author(s):  
A. A. Zinina ◽  
◽  
L. Y. Zaidelman ◽  
A. A. Kotov ◽  
N. A. Arinkin ◽  
...  

The emotional behavior of a companion robot is important for human-robot interaction in the situation of training tasks. We examined the influence of emotional gestures and emotional speech of the robot on its perception by primary school students (N=52, male, female, mean age 9.8) in the situation of joint solution of the spatial Tangram puzzle. It was shown that emotional gestures make a significant contribution to the attractiveness of the robot for the child. It was also found that test subjects prefer the robot with emotional gestures and speech over the robot with neutral gesture and speech behavior. The study also analyzed the communicative behavior of children, identified typical communicative signs that are typical for interaction start with the robot, for monitoring the game and for difficult situations. We described typical mistakes that children make when assembling a puzzle together with the robot.


Author(s):  
D. A. Chernova ◽  
◽  
S. V. Alexeeva ◽  
N. A. Slioussar ◽  
◽  
...  

Even if we know how to spell, we often see words misspelled by other people — especially nowadays when we constantly read unedited texts on social media and in personal messages. In this paper, we present two experiments showing that the incidence of orthographic errors reduces the quality of lexical representations in the mental lexicon—even if one knows how to spell a word, repeated exposure to incorrect spellings blurs its orthographical representation and weakens the connection between form and meaning. As a result, it is more difficult to judge whether the word is spelled correctly, and — more surprisingly — it takes more time to read the word even when there are no errors. We show that when all other factors are balanced the effect of misspellings is more pronounced for the words with lower frequency. We compare our results with the only previous study addressing the problem of misspellings’ influence on the processing of correctly spelled words — it was conducted on the English data. It may be interesting to explore this issue in a cross-linguistic perspective. In this study, we turn to Russian, which differs from English by a more transparent orthography. Much larger corpora of unedited texts are available for English than for Russian, but, using a different way to estimate the incidence of misspellings, we obtained similar results and could also make some novel generalizations. In Experiment 1 we selected 44 words that are frequently misspelled and presented in two conditions (with or without spelling errors) and were distributed across two experimental lists. For every word, participants were asked to determine whether it is spelled correctly or not. The frequency of the word and the relative frequency of its misspelled occurrences significantly influenced the number of incorrect responses: not only it takes longer to read frequently misspelled words, it is also more difficult to decide whether they are spelled correctly. In Experiment 2 we selected 30 words from the materials of Experiment 1 and for every selected word, we found a pair that is matched for length and frequency, but is rarely misspelled due to its orthographic transparency. We used a lexical decision task, presenting these 60 words in the correct spelling, as well as 60 nonwords. We used LMMs for statistics. Firstly, the word type factor was significant: it takes more time to recognize a frequently misspelled word, which replicates the results obtained for English. Secondly, the interaction between the word type factor and the frequency factor was significant: the effect of misspellings was more pronounced for the words of lower frequency. We can conclude that high frequency words have more robust representations that resist blurring more efficiently than low frequency ones. Finally, we conducted a separate analysis showing that the number of incorrect responses in Experiment 1 correlates with RTs in Experiment 2. Thus, whether we consciously try to find an error or simply read words orthographic representations blurred due to exposure to frequent misspellings make the task more difficult.


Author(s):  
V. I. Podlesskaya ◽  

Based on data from the Russian National Corpus and the General InternetCorpus of Russian, the paper addresses syntactic, sematic and prosodic features of constructions with the demonstrative TOT used as an anaphor. These constructions have gained some attention in earlier studies [Paducheva 2016], [Berger, Weiss 1987], [Kibrik 2011], [Podlesskaya 2001], but their analysis (a) covered primarily their prototypical uses; and (b) was based on written data. The data from informal, esp. from spoken discourse show however that the actual use of these constructions may deviate considerably from the known prototype. The paper aims at bridging this gap. I claim (i) that the function of TOT is to temporary promote a referent from a less privileged discourse status to a more privileged one; and (ii) that TOT can be analyzed on a par with switch reference devices in the languages where the latter are grammatically marked (e.g. on verb forms). The following parameters of TOT-constructions are discussed: syntactic and semantic roles of TOT and of its antecedent in their respective clauses, linear and structural distances between TOT and its antecedent, animacy of the maintained referent. Special attention is payed to the information structure of the TOT construction: I give structural and prosodic evidence that TOT never has a rhematic status. The revealed actual distribution of TOT (a) adds to our understanding of cross-linguistic variation of anaphoric functions of demonstratives; and, hopefully, (b) may contribute to further developing computational approaches to coreference and anaphora resolution for Russian, e.g. by improving datasets necessary for this task.


Author(s):  
A. A. Endresen ◽  
◽  
V. A. Zhukova ◽  
D. D. Mordashova ◽  
E. V. Rakhilina ◽  
...  

We present a new open-access electronic resource named the Russian Constructicon that offers a searchable database of Russian constructions accompanied by descriptions of their properties and illustrated with corpus examples. The project was carried out over the period 2016–2020 and at present contains an inventory of over 2200 multi-word constructions of Contemporary Standard Russian. We prioritize “partially schematic” constructions that lie between the two extremes of fully compositional syntactic sequences on the one hand and fully idiomatic (phraseological) expressions on the other hand. Constructions of this type are difficult to account for in terms of either lexicon or grammar alone, and are often underrepresented in reference works of Russian. A typical construction in our database contains a fixed part (anchor words) and an open slot that can be filled with a restricted set of lexemes. In this paper we first focus on key characteristics of this resource that make it different from existing constructicons of other languages. Second, we describe how the new interface will be designed and how it will serve the needs of both linguists and L2 learners of Russian. In particular, we discuss various search possibilities relevant for different users and those parameters that are available for specifying the retrieval output. An example of an entry is given to show how the information about each construction is structured and presented. Third, we provide an overview of our multi-level semantic classification of constructions. We argue that our system of semantic and syntactic tags subdivides our items into meaningful classes and smaller groups and eventually facilitates the identification of constructional families and clusters. This methodology works well in turning the initial list of constructions as unrelated units into a structured network and makes it possible to refine and expand the collected inventory of constructions in a systematic way.


Author(s):  
O. N. Lyashevskaya ◽  
◽  
L. N. Ostyakova ◽  
E. A. Salnikov ◽  
O. A. Semenova ◽  
...  

Orthographic and morphological heterogeneity of historical texts in premodern Slavic causes many difficulties in pos- and morphological tagging. Existing approaches to these tasks show state-of-the-art results without normalization, but they are still very sensitive to the properties of training data such as genre and origin. In this paper, we investigate to what extent the heterogeneity and size of the training corpus influence the quality of pos tagging and morphological analysis. We observe that UDpipe trained on different parts of the Middle Russian corpus demonstrates a boost in accuracy when using less training data. We resolve this paradox by analyzing the distribution of pos-tags and short words across subcorpora.


Sign in / Sign up

Export Citation Format

Share Document