Proceedings of the 13th Linguistic Annotation Workshop

Design and development of Iberia: a corpus of scientific Spanish

Corpora ◽

10.3366/cor.2011.0010 ◽

2011 ◽

Vol 6 (2) ◽

pp. 145-158 ◽

Cited By ~ 2

Author(s):

Jordi Porta Zamorano ◽

Emilio del Rosal García ◽

Ignacio Ahumada Lara

Keyword(s):

User Interface ◽

Design And Development ◽

Linguistic Annotation

Iberia is a synchronic corpus of scientific Spanish designed mainly for terminological studies. In this paper, we describe its design and the infrastructure for its acquisition, processing and exploitation, including mark-up, linguistic annotation, indexing and the user interface. Two pre-processing tasks affecting a large number of words are described in detail: de-hyphenation and identification of text fragments in other languages. We also show how some of the reported statistics, namely, dispersion and association, are used for research on lexis.

Download Full-text

Proceedings of The 9th Linguistic Annotation Workshop

10.3115/v1/w15-16 ◽

2015 ◽

Keyword(s):

Linguistic Annotation

Download Full-text

INITIAL STEP OF SPECIALIZED CORPORA BUILDING: CLEANING PROCEDURES

NORDSCI Conference proceedings, Book 1 Volume 3 ◽

10.32008/nordsci2020/b1/v3/16 ◽

2020 ◽

Author(s):

Vera Yakubson ◽

Victor Zakharov

Keyword(s):

Text Processing ◽

Research Question ◽

Initial Step ◽

Main Research ◽

Academic Texts ◽

Linguistic Annotation ◽

Significant Difference ◽

Cleaning Procedures ◽

Unique Source ◽

Future Work

This paper deals with the specialized corpora building, specifically academic language corpus in the biotechnology field. Being a part of larger research devoted to creation and usage of specialized parallel corpus, this piece aims to analyze the initial step of corpus building. Our main research question was what procedures we need to implement to the texts before using them to develop the corpus. Analysis of previous research showed the significant quantity of papers devoted to corpora creation, including academic specialized corpora. Different sides of the process were analyzed in these researches, including the types of texts used, the principles of crawling, the recommended length of texts etc. As to the text processing for the needs of corpora creation, only the linguistic annotation issues were examined earlier. At the same time, the preliminary cleaning of texts before their usage in corpora may have significant influence on the corpus quality and its utility for the linguistic research. In this paper, we considered three small corpora derived from the same set of academic texts in the biotechnology field: “raw” corpus without any preliminary cleaning and two corpora with different level of cleaning. Using different Sketch Engine tools, we analyzed these corpora from the position of their future users, predominantly as sources for academic wordlists and specialized multi-word units. The conducted research showed very little difference between two cleaned corpora, meaning that only basic cleaning procedures such as removal of reference lists are can be useful in corpora design. At the same time, we found a significant difference between raw and cleaned corpora and argue that this difference can affect the quality of wordlists and multi-word terms extraction, therefore these cleaning procedures are meaningful. The main limitation of the study is that all texts were taken from the unique source, so the conclusions could be affected by this specific journal’s peculiarities. Therefore, the future work should be the verification of results on different text collections

Download Full-text

The linguistic annotation system of the Stockholm

Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics - ◽

10.3115/976744.976808 ◽

1993 ◽

Cited By ~ 6

Author(s):

Gunnel Källgren ◽

Gunnar Eriksson

Keyword(s):

Annotation System ◽

Linguistic Annotation

Download Full-text

Beyond the Edition: On the Linguistic Annotation of Vernacular Texts

Ars Edendi Lecture Series, vol. V ◽

10.16993/bbd.e ◽

2020 ◽

pp. 67-88

Author(s):

Odd Einar Haugen

Keyword(s):

Linguistic Annotation

Download Full-text

Empirical evidence for prosodic phrasing: pauses as linguistic annotation in Korean read speech

10.21437/interspeech.2007-215 ◽

2007 ◽

Author(s):

Hyongsil Cho ◽

Daniel Hirst

Keyword(s):

Empirical Evidence ◽

Prosodic Phrasing ◽

Linguistic Annotation ◽

Read Speech

Download Full-text

Amalgamated Approach for Devanagari Script Corpus for OCR & Demographic Purpose and XML for Linguistic Annotation

2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) ◽

10.1109/sitis.2017.50 ◽

2017 ◽

Cited By ~ 1

Author(s):

Maninder Singh Nehra ◽

Neeta Nain ◽

Mushtaq Ahmed ◽

Prakash Choudhary ◽

Deepa Modi

Keyword(s):

Linguistic Annotation

Download Full-text

OLiA – Ontologies of Linguistic Annotation

Semantic Web ◽

10.3233/sw-140167 ◽

2015 ◽

Vol 6 (4) ◽

pp. 379-386 ◽

Cited By ~ 21

Author(s):

Christian Chiarcos ◽

Maria Sukhareva

Keyword(s):

Linguistic Annotation

Download Full-text

Scope definition in software product lines: A semi-automatic approach through linguistic annotation

2012 XXXVIII Conferencia Latinoamericana En Informatica (CLEI) ◽

10.1109/clei.2012.6427193 ◽

2012 ◽

Cited By ~ 1

Author(s):

Andressa Ianzen ◽

Andreia Malucelli ◽

Sheila Reinehr

Keyword(s):

Software Product Lines ◽

Product Lines ◽

Scope Definition ◽

Linguistic Annotation ◽

Software Product

Download Full-text

Abstractive Thai Opinion Summarization

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.971-973.2273 ◽

2014 ◽

Vol 971-973 ◽

pp. 2273-2280

Author(s):

Orawan Chaowalit ◽

Ohm Sornil

Keyword(s):

Word Identification ◽

Internet Technology ◽

The Internet ◽

Global Models ◽

Customer Reviews ◽

Thai Language ◽

Novel Technique ◽

Opinion Summarization ◽

Linguistic Annotation ◽

Daunting Task

With the advancement in the Internet technology, customers can easily share opinions on services and products in forms of reviews. There can be large amounts of reviews for popular products. Manually summarizing those reviews for important issues is a daunting task. Automatic opinion summarization is a solution to the problem. The task is more complicated for reviews written in Thai language. Thai words are written continuously without space, there is no symbol to signify the end of a sentence, and many reviews are written informally, thus accurate word identification and linguistic annotation cannot be relied upon. This research proposes a novel technique to generate abstractive summaries of customer reviews written in Thai language. The proposed technique, which consists of the local and the global models, is evaluated using actual reviews of fifty randomly selected products from a popular cosmetic website. The results show that the local model outperforms the other model and the two baseline methods both quantitatively and qualitatively.

Download Full-text