syntactic annotation
Recently Published Documents


TOTAL DOCUMENTS

60
(FIVE YEARS 9)

H-INDEX

7
(FIVE YEARS 1)

Author(s):  
Hour Kaing ◽  
Chenchen Ding ◽  
Masao Utiyama ◽  
Eiichiro Sumita ◽  
Sethserey Sam ◽  
...  

As a highly analytic language, Khmer has considerable ambiguities in tokenization and part-of-speech (POS) tagging processing. This topic is investigated in this study. Specifically, a 20,000-sentence Khmer corpus with manual tokenization and POS-tagging annotation is released after a series of work over the last 4 years. This is the largest morphologically annotated Khmer dataset as of 2020, when this article was prepared. Based on the annotated data, experiments were conducted to establish a comprehensive benchmark on the automatic processing of tokenization and POS-tagging for Khmer. Specifically, a support vector machine, a conditional random field (CRF) , a long short-term memory (LSTM) -based recurrent neural network, and an integrated LSTM-CRF model have been investigated and discussed. As a primary conclusion, processing at morpheme-level is satisfactory for the provided data. However, it is intrinsically difficult to identify further grammatical constituents of compounds or phrases because of the complex analytic features of the language. Syntactic annotation and automatic parsing for Khmer will be scheduled in the near future.


2021 ◽  
Vol 1 (1) ◽  
pp. 1-32
Author(s):  
Erica Biagetti ◽  
Oliver Hellwig ◽  
Salvatore Scarlata ◽  
Elia Ackermann ◽  
Paul Widmer

Abstract In this paper we introduce an extended version of the Vedic Treebank (vtb, Hellwig et al. 2020) which comes along with revisited and extended annotation guidelines. In order to assess the quality of our annotations as well as the usability and limits of the guidelines we performed an inter-annotator agreement test. The results show that agreement between annotators is hampered by various factors, most prominently by insufficient understanding of the content because of the cultural and temporal gap and incomplete knowledge of Vedic grammar. An in-depth discussion of disagreeing annotations demonstrates that the setup of the workflow, too, has a major influence on inter-annotator agreement. We suggest some measures that can help increase the transparency and annotation consistency according to current knowledge of the language when annotating Vedic Sanskrit, or ancient language varieties in general.


Author(s):  
Dana Halabi ◽  
Arafat Awajan ◽  
Ebaa Fayyoumi

Arabic dependency parsers have a poor performance compared to parsers of other languages. Recently the impact of annotation at lexical level of dependency treebank on the overall performance of the dependency parses has been extensively investigated. This paper focuses on the impact of coarse-grained and fine-grained dependency relations on the performance of Arabic dependency parsers. Moreover, this paper introduces the annotation rules for I3rab dependency treebank. Experimentally, the obtained results showed that having an appropriate set of dependency relations improves the performance of an Arabic dependency parser up to 27.55%.


2020 ◽  
Vol 25 (1) ◽  
pp. 39-71
Author(s):  
Jelke Bloem

Abstract In this contribution, I discuss the use of automatic syntactic annotation in Dutch corpus research, using a case study of five-verb clusters. Large amounts of text can be annotated automatically, but the parser makes mistakes, while correct annotation is very important in linguistic research. How much of a problem is this, and how can we learn about the extent of these parsing mistakes? There are several approaches to evaluating the quality of automatic annotation for specific research questions. I demonstrate these approaches for the case study at hand, which will help us to make claims based on automatically annotated corpus data with greater confidence.


2020 ◽  
Vol 8 (2) ◽  
pp. 133-158
Author(s):  
María José López-Couso ◽  
Belén Méndez-Naya

This article discusses some of the potential problems derived from the syntactic annotation of historical corpora, especially in connection with low-frequency phenomena. By way of illustration, we examine the parsing scheme used in the Penn Parsed Corpora of Historical English (PPCHE) for clauses introduced by so-called ‘minor declarative complementizers’, originally adverbial links which come to be occasionally used in complementizer function. We show that the functional similarities between canonical declarative complement clauses introduced by the major declarative links that and zero and those headed by minor declarative complementizers are not captured by the PPCHE parsing, where the latter constructions are not tagged as complement clauses, but rather as adverbial clauses. The examples discussed reveal that, despite the obvious advantages of parsed corpora, annotation may sometimes mask interesting linguistic facts.


Author(s):  
A. V. Zimmerling ◽  
◽  
◽  

This paper offers a corpus analysis of the Russian verb быть ‘be’ which has an abnormal present tense paradigm including a zero form ØBE.PRES and overt forms естьBE.PRES and сутьBE.PRES which do not discriminate person and number and are distributed syntactically. I discuss different approaches to the grammar of быть and argue that Apresjan’s model which recognizes ØBE.PRES, естьBE.PRES and сутьBE.PRES as parts of one and the same lemma is superior to alternative models splitting быть split into two lemmas representing copula vs content verb ‘be’. The peripheral status of overt present BE-forms compared with ØBE.PRES in the Russian National Corpus is confirmed by three measures: 1) dispersion of texts where a BE-form occurs; 2) uneven coverage in different persons and numbers; 3) ratio of copular uses vs content verb uses. 1–2 person present tense BE-forms attested in RNC are internal borrowings from Old Russian and Old Church Slavonic, while естьBE.PRES and сутьBE.PRES are inherited 3rd person elements which take over 1–2 person uses. The historical 3Pl суть is redundant in a system, where a more frequent 3rd person form есть is licensed in the plural: it survives by a minority of speakers either as an optional 3Pl copula in formal discourse or as an emphatic copula in oral discourse. The form естьBE.PRES occurs in all persons and numbers both as content verb and as copula but is underrepresented as 3Pl copula: this gap is filled by ØBE.PRES. The frequency of the zero copula ØBE.PRES can be measured in corpora without syntactic annotation on the basis of systemic proportion between present vs past tense uses of быть and on the basis of approximation samples for contexts where overt copulas alternate with ØBE.PRES.


2019 ◽  
Vol 34 (4) ◽  
pp. 283-294 ◽  
Author(s):  
Huyen T M Nguyen ◽  
Quyen T Ngo ◽  
Luong X Vu ◽  
Vu M Tran ◽  
Hien T T Nguyen

Named entities (NE) are phrases that contain the names of persons, organizations, locations, times and quantities, monetary values, percentages, etc. Named Entity Recognition (NER) is the task of recognizing named entities in documents. NER is an important subtask of Information Extraction, which has attracted researchers all over the world since 1990s. For Vietnamese language, although there exists some research projects and publications on NER task before 2016, no systematic comparison of the performance of NER systems has been done. In 2016, the organizing committee of the VLSP workshop decided to launch the first NER shared task, in order to get an objective evaluation of Vietnamese NER systems and to promote the development of high quality systems. As a result, the first dataset with morpho-syntactic and NE annotations has been released for benchmarking NER systems. At VLSP 2018, the NER shared task has been organized for the second time, providing a bigger dataset containing texts from various domains, but without morpho-syntactic annotation. These resources are available for research purpose via the VLSP website vlsp.org.vn/resources. In this paper, we describe the datasets as well as the evaluation results obtained from these two campaigns.


FORUM ◽  
2018 ◽  
Vol 16 (2) ◽  
pp. 241-264 ◽  
Author(s):  
Daria Dayter

Abstract The paper introduces a corpus of simultaneous interpretation, SIREN. SIREN is a parallel aligned bidirectional corpus of original and simultaneously interpreted speech in Russian and English. At the moment the corpus contains 235,040 words and is enriched with POS and shallow syntactic annotation. After outlining the corpus design, I used scores for lexical variety, density and POS proportionalities to make tentative claims about the linguistic variation between originals and interpretations. Low lexical variety and density are taken as indicators of simplification, while a higher ratio of nominal to pronominal reference is seen as an indicator of explicitation. Atypical wordclass distribution indicates the source language shining through. Somewhat contradictory results, with the Russian subcorpus conforming to the predictions of translation theory and the English subcorpus exhibiting the opposite trend in all universals but still shining through, invites further investigation of the data and once again puts into question unequivocal claims about T-universals.


Sign in / Sign up

Export Citation Format

Share Document