Computational Linguistics and Intellectual Technologies
Latest Publications


TOTAL DOCUMENTS

79
(FIVE YEARS 0)

H-INDEX

1
(FIVE YEARS 0)

Published By Russian State University For The Humanities

9785728129462

Author(s):  
N. V. Remnev ◽  

The task of recognizing the author’s native (Native Language Identification—NLI) language based on a texts, written in a language that is non-native to the author—is the task of automatically recognizing native language (L1). The NLI task was studied in detail for the English language, and two shared tasks were conducted in 2013 and 2017, where TOEFL English essays and essay samples were used as data. There is also a small number of works where the NLI problem was solved for other languages. The NLI problem was investigated for Russian by Ladygina (2017) and Remnev (2019). This paper discusses the use of well-established approaches in the NLI Shared Task 2013 and 2017 competitions to solve the problem of recognizing the author’s native language, as well as to recognize the type of speaker—learners of Russian or Heritage Russian speakers. Native language identification task is also solved based on the types of errors specific to different languages. This study is data-driven and is possible thanks to the Russian Learner Corpus developed by the Higher School of Economics (HSE) Learner Russian Research Group on the basis of which experiments are being conducted.



Author(s):  
V. F. Vydrin ◽  
◽  
J. J. Méric ◽  

A model for the development of a corpus-driven spelling dictionary for the Bambara language is described. First, a list of about 4,000 lexemes characterized by spelling variability is extracted from an electronic BambaraFrench dictionary. At the next stage, a script is applied to determine the number of occurrences of each spelling variant in the Bambara Reference Corpus, separately for the entire Corpus (more than 11 million words) and for its disambiguated subcorpus (about 1.5 million words). Statistics on the diversity of sources and authors are also obtained automatically. The statistical data are then sorted manually into two lists of lexemes: those whose standard spelling can be established statistically, and those requiring evaluation by expert linguists. Some difficult cases are discussed in the paper. At the final stage, a representative expert commission will discuss all those lexemes for which statistical data alone do not suffice to define a standard spelling variant, before taking a final decision on each. The resulting Bambara spelling dictionary will be published electronically and on paper.



Author(s):  
A. A. Zinina ◽  
◽  
L. Y. Zaidelman ◽  
A. A. Kotov ◽  
N. A. Arinkin ◽  
...  

The emotional behavior of a companion robot is important for human-robot interaction in the situation of training tasks. We examined the influence of emotional gestures and emotional speech of the robot on its perception by primary school students (N=52, male, female, mean age 9.8) in the situation of joint solution of the spatial Tangram puzzle. It was shown that emotional gestures make a significant contribution to the attractiveness of the robot for the child. It was also found that test subjects prefer the robot with emotional gestures and speech over the robot with neutral gesture and speech behavior. The study also analyzed the communicative behavior of children, identified typical communicative signs that are typical for interaction start with the robot, for monitoring the game and for difficult situations. We described typical mistakes that children make when assembling a puzzle together with the robot.





Author(s):  
D. A. Chernova ◽  
◽  
S. V. Alexeeva ◽  
N. A. Slioussar ◽  
◽  
...  

Even if we know how to spell, we often see words misspelled by other people — especially nowadays when we constantly read unedited texts on social media and in personal messages. In this paper, we present two experiments showing that the incidence of orthographic errors reduces the quality of lexical representations in the mental lexicon—even if one knows how to spell a word, repeated exposure to incorrect spellings blurs its orthographical representation and weakens the connection between form and meaning. As a result, it is more difficult to judge whether the word is spelled correctly, and — more surprisingly — it takes more time to read the word even when there are no errors. We show that when all other factors are balanced the effect of misspellings is more pronounced for the words with lower frequency. We compare our results with the only previous study addressing the problem of misspellings’ influence on the processing of correctly spelled words — it was conducted on the English data. It may be interesting to explore this issue in a cross-linguistic perspective. In this study, we turn to Russian, which differs from English by a more transparent orthography. Much larger corpora of unedited texts are available for English than for Russian, but, using a different way to estimate the incidence of misspellings, we obtained similar results and could also make some novel generalizations. In Experiment 1 we selected 44 words that are frequently misspelled and presented in two conditions (with or without spelling errors) and were distributed across two experimental lists. For every word, participants were asked to determine whether it is spelled correctly or not. The frequency of the word and the relative frequency of its misspelled occurrences significantly influenced the number of incorrect responses: not only it takes longer to read frequently misspelled words, it is also more difficult to decide whether they are spelled correctly. In Experiment 2 we selected 30 words from the materials of Experiment 1 and for every selected word, we found a pair that is matched for length and frequency, but is rarely misspelled due to its orthographic transparency. We used a lexical decision task, presenting these 60 words in the correct spelling, as well as 60 nonwords. We used LMMs for statistics. Firstly, the word type factor was significant: it takes more time to recognize a frequently misspelled word, which replicates the results obtained for English. Secondly, the interaction between the word type factor and the frequency factor was significant: the effect of misspellings was more pronounced for the words of lower frequency. We can conclude that high frequency words have more robust representations that resist blurring more efficiently than low frequency ones. Finally, we conducted a separate analysis showing that the number of incorrect responses in Experiment 1 correlates with RTs in Experiment 2. Thus, whether we consciously try to find an error or simply read words orthographic representations blurred due to exposure to frequent misspellings make the task more difficult.



Author(s):  
O. N. Lyashevskaya ◽  
◽  
L. N. Ostyakova ◽  
E. A. Salnikov ◽  
O. A. Semenova ◽  
...  

Orthographic and morphological heterogeneity of historical texts in premodern Slavic causes many difficulties in pos- and morphological tagging. Existing approaches to these tasks show state-of-the-art results without normalization, but they are still very sensitive to the properties of training data such as genre and origin. In this paper, we investigate to what extent the heterogeneity and size of the training corpus influence the quality of pos tagging and morphological analysis. We observe that UDpipe trained on different parts of the Middle Russian corpus demonstrates a boost in accuracy when using less training data. We resolve this paradox by analyzing the distribution of pos-tags and short words across subcorpora.



Author(s):  
I. M. Boguslavsky ◽  
◽  
V. G. Dikonov ◽  
T. I. Frolova ◽  
L. L. Iomdin ◽  
...  

Text interpretation often requires common sense knowledge and reasoning. A convenient tool for developing methods of common sense reasoning are special sets of challenge problems whose interpretation requires sophisticated reasoning. An interesting example is a recently published data set called Triangle Choice of Plausible Alternatives (Triangle-COPA), which contains 100 multiple-choice problems that test the interpretation of social scenarios. Each problem includes a statement and two alternatives. The task is to identify the more plausible alternative. For processing Triangle-COPA data we use SemETAP, a general purpose semantic analyzer. We implement the full scenario of NL understanding starting from NL texts and not from manually composed simplified logical formulas, which is a common practice in logic-based approaches to common sense reasoning. We produce Enhanced Semantic Structures of the statement and both alternatives and check which alternative manifests more semantic agreement with the statement in terms of inferences.



Author(s):  
E. V. Budennaya ◽  
◽  
A. A. Evdokimova ◽  
Ju. V. Nikolaeva ◽  
N. V. Sukhova ◽  
...  

The article addresses the relation of referential expressions and co-occurring kinetic phenomena (hand and head gestures) on the material of the RUPEX multimodal corpus. The results reflect significant differences in how individual movements and gestures are aligned with two major types of reference (full NPs vs. reduced expressions). It was initially assumed that full NPs are more often accompanied by a gesture. Our data support this hypothesis not only through the material of hand gestures, but also through head movements. Moreover, full NPs are more likely to be accompanied by downward movements in both manual and cephalic channels, as well as by metadiscourse gestures, in comparison to reduced referential units (personal and demonstrative pronouns). In addition, pronouns are more likely to be aligned with pointing hand gestures and zero reference is often accompanied by descriptive hand gestures. However, the kinetic behavior of the interlocutors is determined by a variety of factors, including the topic of the conversation, which predisposes to certain types of gestures and the relative position of the interlocutors.



Author(s):  
O. Iu. Chuikova ◽  

The paper deals with a number of characteristics of the secondary imperfectivation of po-perfectives in Russian. The study is based on the analysis of the level of imperfectivability of Russian perfective verbs with the prefix po- compared to a number of other prefixed perfective verb groups (e.g. the verbs with such perfectivizing prefixes as na-, za-, etc.) according to the Dictionary of Russian Language, the Russian National Corpus and the Russian-language Internet (Runet). It is shown that the discussed perfective verb group is specific as a whole as well as with respect to its subgroups, i.e., deperfective perfective verbs and morphologically marked Aktionsarten. Po-perfectives demonstrate a low average imperfectivability in comparison to corresponding figures for other prefixed verb groups. For the subgroup of deperfective (formed from perfective stems) verbs the level of imperfectivability is also unusually low. The delimitative Aktionsart shows a higher imperfectivability than other morphologically marked Aktionsarten do. Possible explanations for the peculiarities of imperfectivability of poperfectives rather confirm than contradict the hypothesis about the regularity of the secondary imperfectivation in Russian.



Author(s):  
V. Malykh ◽  
◽  
D. Cherniavskii ◽  
A. Valukov ◽  
◽  
...  


Sign in / Sign up

Export Citation Format

Share Document