Corpus annotation

2021 ◽  
pp. 110-135
Author(s):  
Danielle Barth ◽  
Stefan Schnell
Keyword(s):  
Corpora ◽  
2009 ◽  
Vol 4 (2) ◽  
pp. 189-208 ◽  
Author(s):  
Yufang Qian ◽  
Scott Piao

In this paper, we propose a corpus annotation scheme and lexicon for Chinese kinship terms. We modify existing traditional Chinese kinship schemes into a comprehensive semantic field framework that covers kinship semantic categories in contemporary Chinese. The scheme is inspired by the Lancaster USAS (UCREL Semantic Analysis System) taxonomy, which contains categories for English kinship terms. We show how our scheme works with a Chinese kinship semantic lexicon which covers parents, siblings, marital relations, off-spring and same-sex partnerships. The kinship lexicon was created through a pilot study involving the Lancaster University Mandarin Corpus. We foresee that our annotation scheme and lexicon will provide a framework and resource for the kinship annotation of Chinese corpora and corpus-based kinship studies.


2012 ◽  
Author(s):  
Felipe Rodrigues ◽  
Richard Semolini ◽  
Norton Trevisan Roman ◽  
Ana Maria Monteiro

This paper describes TSeg – a Java application that allows for both manual and automatic segmentation of a source text into basic units of annotation. TSeg provides a straightforward way to approach this task through a clear point-and-click interface. Once finished the text segmentation, the application outputs an XML file that may be used as input to a more problem specific annotation software. Hence, TSeg moves the identification of basic units of annotation out of the task of annotating these units, making it possible for both problems to be analysed in isolation, thereby reducing the cognitive load on the user and preventing potential damages to the overall outcome of the annotation process.


2020 ◽  
Vol 27 (4) ◽  
pp. 889-931
Author(s):  
Yudai Kishimoto ◽  
Yugo Murawaki ◽  
Daisuke Kawahara ◽  
Sadao Kurohashi

2021 ◽  
Vol 29 (2) ◽  
pp. 859
Author(s):  
Márcio De Souza Dias ◽  
Ariani Di Felippo ◽  
Amanda Pontes Rassi ◽  
Paula Cristina Figueira Cardoso ◽  
Fernando Antônio Asevedo Nóbrega ◽  
...  

Abstract: Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.Keywords: automatic summarization; multi-document summary; linguistic problem; corpus annotation.Resumo: Sumários automáticos geralmente apresentam vários problemas linguísticos que afetam a sua qualidade textual e, consequentemente, sua compreensão pelos usuários. Alguns trabalhos caracterizam tais problemas e os relacionam ao desempenho dos sistemas de sumarização. Neste artigo, investigaram-se os problemas em extratos (isto é, sumários produzidos pela concatenação de sentenças extraídas na íntegra dos textos-fonte) multidocumento em Português do Brasil gerados por sistemas que apresentam diferentes abordagens (isto é, superficial e profunda) e desempenho (isto é, métodos baseline e do estado-da-arte). Para tanto, as principais caracterizações dos problemas linguísticos em sumários automáticos foram investigadas, resultando em uma tipologia mais adequada à sumarização multidocumento. Em seguida, anotou-se manualmente um corpus de extratos com base na tipologia, evidenciando que alguns tipos de problemas são significativamente mais recorrentes que outros. Assim, essa anotação gera subsídios para as tarefas automáticas de detecção e correção de problemas linguísticos com vistas à produção de sumários automáticos não só mais informativos (isto é, que cobrem o conteúdo do material de origem), como também linguisticamente bem-estruturados.Palavras-chave: sumarização automática; sumário multidocumento; problema linguístico; anotação de corpus.


Author(s):  
A. McEnery ◽  
I. Tanaka ◽  
S. Botley

Author(s):  
John Newman ◽  
Christopher Cox
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document