Term extraction from sparse, ungrammatical domain-specific documents

2013 ◽  
Vol 40 (7) ◽  
pp. 2530-2540 ◽  
Author(s):  
Ashwin Ittoo ◽  
Gosse Bouma
Terminology ◽  
2015 ◽  
Vol 21 (2) ◽  
pp. 151-179 ◽  
Author(s):  
Carlos Periñán-Pascual

The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small- and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.


2017 ◽  
Vol 24 (2) ◽  
pp. 163-198 ◽  
Author(s):  
CARLOS PERIÑAN-PASCUAL

AbstractAutomatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.


Terminology ◽  
2018 ◽  
Vol 24 (1) ◽  
pp. 122-147
Author(s):  
Mercè Vàzquez ◽  
Antoni Oliver

Abstract The identification of reliable terms from domain-specific corpora using computational methods is a task that has to be validated manually by specialists, which is a highly time-consuming activity. To reduce this effort and improve term candidate selection, we implemented the Token Slot Recognition method, a filtering method based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. This paper presents the implementation of the term candidates filtering method we developed in linguistic and statistical approaches applied for automatic term extraction using several domain-specific corpora in different languages. We observed that the filtering method outperforms term candidate selection by ranking a higher number of terms at the top of the term candidate list than raw frequency, and for statistical term extraction the improvement is between 15% and 25% both in precision and recall. Our analyses further revealed a reduction in the number of term candidates to be validated manually by specialists. In conclusion, the number of term candidates extracted automatically from domain-specific corpora has been reduced significantly using the Token Slot Recognition filtering method, so term candidates can be easily and quickly validated by specialists.


Author(s):  
Wilson Wong

As more electronic text is readily available, and more applications become knowledge intensive and ontology-enabled, term extraction, also known as automatic term recognition or terminology mining is increasingly in demand. This chapter first presents a comprehensive review of the existing techniques, discusses several issues and open problems that prevent such techniques from being practical in real-life applications, and then proposes solutions to address these issues. Keeping afresh with the recent advances in related areas such as text mining, we propose new measures for the determination of unithood, and a new scoring and ranking scheme for measuring termhood to recognise domain-specific terms. The chapter concludes with experiments to demonstrate the advantages of our new approach.


Terminology ◽  
2015 ◽  
Vol 21 (2) ◽  
pp. 205-236 ◽  
Author(s):  
Robert Gaizauskas ◽  
Monica Lestari Paramita ◽  
Emma Barker ◽  
Marcis Pinnis ◽  
Ahmet Aker ◽  
...  

In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems.


Sign in / Sign up

Export Citation Format

Share Document