annotated corpora
Recently Published Documents


TOTAL DOCUMENTS

93
(FIVE YEARS 22)

H-INDEX

7
(FIVE YEARS 2)

Author(s):  
Алексей Сергеевич Серый ◽  
Анна Александровна Гриневич ◽  
Владислав Александрович Лисин

В статье предложен подход к построению исследовательской среды для интеграции информационных ресурсов определенной области знаний и поддержки научных исследований. Особенностью подхода является комбинация в рамках единой информационной системы, основанных на онтологиях средств представления, систематизации и аннотирования интегрированных в систему ресурсов, а также ориентация на совместную работу специалистов над созданием размеченного корпуса. В статье приведен пример применения предложенного подхода для разработки информационной системы. The paper presents an approach to the development of a research environment, facilitating an integration of information resources dedicated to a certain scientific domain and supporting scientific research. The main feature of the approach is combining an ontology-based tools for presenting and annotating scientific information resources within a single information system. The development of the information system is aimed towards the joint work of researchers on the creating annotated corpora of resources. The paper provides an example of the proposed approach being put into practice when developing an information system.


2021 ◽  
Vol 21 (2) ◽  
pp. 263-297
Author(s):  
Chiara Zanchi

Abstract This paper presents the Homeric Dependency Lexicon (HoDeL), a new resource with a user-friendly interface facilitating the study of Homeric verbs and dependents. HoDeL was induced from the analytical layer of AGDT 2.0, extracting all dependents tagged as SBJ, OBJ, PNOM, and OCOMP with a set of SQL queries. The paper illustrates HoDeL functionalities and shows how they can be employed by researchers to answer specific research questions about the Homeric language. Introducing the uses of HoDeL offers the opportunity to reexamine some crucial, though frequently underestimated, methodological challenges concerning annotated corpora and resources derived from them that relate to the linguistic theories underlying annotations and error propagation. It is argued that the careful documentation of how linguistic resources were created, what data they contain, and how they can be queried through their dedicated interfaces is essential to lay the groundwork for users’ investigations.


2021 ◽  
Author(s):  
K. Abidi ◽  
K. Smaili

In this article, we tackle the issue of sentiment analysis in three Maghrebi dialects used in social networks. More precisely, we are interested by analysing sentiments in Algerian, Moroccan and Tunisian corpora. To do this, we built automatically three lexicons of sentiments, one for each dialect. Each lexicon is composed of words with their polarities, a dialect word could be written in Arabic or in Latin scripts. These lexicons may include French or English words as well as words in Arabic dialect and standard Arabic. The semantic orientation of a word represented by an embedding vector is determined automatically by calculating its distance with several embedding seed words. The embedding vectors are trained on three large corpora collected from YouTube. The proposed approach is evaluated by using few existing annotated corpora in Tunisian and Moroccan dialects. For the Algerian dialect, in addition to a small corpus we found in the literature, we collected and annotated one composed of 10k comments extracted from Youtube. This corpus represents a valuable resource which is proposed for free.


2021 ◽  
Author(s):  
Yingqi Jing ◽  
Damián Ezequiel Blasi ◽  
Balthasar Bickel

A prominent principle in explaining a range of word order regularities is dependency locality, i.e. a principle that minimizes the linear distances (dependency lengths) between the head and its dependents. However, it remains unclear to what extent language users in fact observe locality when producing sentences under diverse conditions of cross-categorical harmony (such as the placement of verbal and nominal heads on the same vs different sides of their dependents), dependency direction (head-final vs head-initial) and parallel vs. hierarchical dependency structures (e.g. multiple adjectives dependent on the same head vs nested genitive dependents). Using 45 dependency-annotated corpora of diverse languages, we find that after controlling for harmony and conditioning on dependency types, dependency length minimization (DLM) is inversely correlated with the overall presence of head-final dependencies. This anti-DLM effect in sentences with more head-final dependencies is specifically associated with an accumulation of dependents in parallel structures and with disharmonic orders in hierarchical structures. We propose a detailed interpretation of these results and tentatively suggest a role for a probabilistic principle that favors embedding head-initial (e.g. VO) structures inside equally head-initial and thereby length-minimizing structures (e.g. relative clauses after the head noun) while head-final (OV) structures have a less pronounced preference for harmony and DLM. This is in line with earlier findings in research on the Greenbergian word order universals and with a probabilistic version of what has been suggested as the Final-Over-Final Condition more recently.


PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0256503
Author(s):  
Alfonso Semeraro ◽  
Salvatore Vilella ◽  
Giancarlo Ruffo

The increasing availability of textual corpora and data fetched from social networks is fuelling a huge production of works based on the model proposed by psychologist Robert Plutchik, often referred simply as the “Plutchik Wheel”. Related researches range from annotation tasks description to emotions detection tools. Visualisation of such emotions is traditionally carried out using the most popular layouts, as bar plots or tables, which are however sub-optimal. The classic representation of the Plutchik’s wheel follows the principles of proximity and opposition between pairs of emotions: spatial proximity in this model is also a semantic proximity, as adjacent emotions elicit a complex emotion (a primary dyad) when triggered together; spatial opposition is a semantic opposition as well, as positive emotions are opposite to negative emotions. The most common layouts fail to preserve both features, not to mention the need of visually allowing comparisons between different corpora in a blink of an eye, that is hard with basic design solutions. We introduce PyPlutchik the Pyplutchik package is available as a Github repository (http://github.com/alfonsosemeraro/pyplutchik) or through the installation commands pip or conda. For any enquiry about usage or installation feel free to contact the corresponding author, a Python module specifically designed for the visualisation of Plutchik’s emotions in texts or in corpora. PyPlutchik draws the Plutchik’s flower with each emotion petal sized after how much that emotion is detected or annotated in the corpus, also representing three degrees of intensity for each of them. Notably, PyPlutchik allows users to display also primary, secondary, tertiary and opposite dyads in a compact, intuitive way. We substantiate our claim that PyPlutchik outperforms other classic visualisations when displaying Plutchik emotions and we showcase a few examples that display our module’s most compelling features.


Author(s):  
Mykola Andrushchenko ◽  
Kirsi Sandberg ◽  
Risto Turunen ◽  
Jani Marjanen ◽  
Mari Hatavara ◽  
...  
Keyword(s):  

Author(s):  
Jianwei Yan

Slavic languages are generally assumed to possess rich morphological features with free syntactic word order. Exploring this complexity trade-off can help us better understand the relationship between morphology and syntax within natural languages. However, few quantitative investigations have been carried out into this relationship within Slavic languages. Based on 34 annotated corpora from Universal Dependencies, this paper paid special attention to the correlations between morphology and syntax within Slavic languages by applying two metrics of morphological richness and two of word order freedom, respectively. Our findings are as follows. First, the quantitative metrics adopted can well capture the distributions of morphological richness and word order freedom of languages. Second, the metrics can corroborate the correlation between morphological richness and word order freedom. Within Slavic languages, this correlation is moderate and statistically significant. Precisely, the richer the morphology, the less strict the word order. Third, Slavic languages can be clustered into three subgroups based on classification models. Most importantly, ancient Slavic languages are characterized by richer morphology and more flexible word order than modern ones. Fourth, as two possible disturbing factors, corpus size does not greatly affect the results of the metrics, whereas corpus genre does play an important part in the measurements of word order freedom. Specifically, the word order of formal written genres tends to be more rigid than that of informal written and spoken ones. Overall, based on annotated corpora, the results verify the negative correlation between morphological richness and word order rigidity within Slavic languages, which might shed light on the dynamic relations between morphology and syntax of natural languages and provide quantitative instantiations of how languages encode lexical and syntactic information for the purpose of efficient communication.


Sign in / Sign up

Export Citation Format

Share Document