scholarly journals Improving term candidates selection using terminological tokens

Terminology ◽  
2018 ◽  
Vol 24 (1) ◽  
pp. 122-147
Author(s):  
Mercè Vàzquez ◽  
Antoni Oliver

Abstract The identification of reliable terms from domain-specific corpora using computational methods is a task that has to be validated manually by specialists, which is a highly time-consuming activity. To reduce this effort and improve term candidate selection, we implemented the Token Slot Recognition method, a filtering method based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. This paper presents the implementation of the term candidates filtering method we developed in linguistic and statistical approaches applied for automatic term extraction using several domain-specific corpora in different languages. We observed that the filtering method outperforms term candidate selection by ranking a higher number of terms at the top of the term candidate list than raw frequency, and for statistical term extraction the improvement is between 15% and 25% both in precision and recall. Our analyses further revealed a reduction in the number of term candidates to be validated manually by specialists. In conclusion, the number of term candidates extracted automatically from domain-specific corpora has been reduced significantly using the Token Slot Recognition filtering method, so term candidates can be easily and quickly validated by specialists.

Terminology ◽  
2015 ◽  
Vol 21 (2) ◽  
pp. 151-179 ◽  
Author(s):  
Carlos Periñán-Pascual

The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small- and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.


2017 ◽  
Vol 24 (2) ◽  
pp. 163-198 ◽  
Author(s):  
CARLOS PERIÑAN-PASCUAL

AbstractAutomatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.


Terminology ◽  
2014 ◽  
Vol 20 (2) ◽  
pp. 151-170 ◽  
Author(s):  
Katia Peruzzo

The paper examines the possible usage of event templates derived from Frame-Based Terminology (Faber et al. 2005, 2006, 2007) as an aid to the extraction and management of legal terminology embedded in the multi-level legal system of the European Union. The method proposed here, which combines semi-automatic term extraction and a simplified event template containing six categories, is applied to an English corpus of EU texts focusing on victims of crime and their rights. Such a combination allows for the extraction of category-relevant terminological units and additional information, which can then be used for populating a terminological knowledge base organised on the basis of the same event template, but which also employs additional classification criteria to account for the multidimensionality encountered in the corpus.


Terminology ◽  
2022 ◽  
Author(s):  
Ayla Rigouts Terryn ◽  
Véronique Hoste ◽  
Els Lefever

Abstract As with many tasks in natural language processing, automatic term extraction (ATE) is increasingly approached as a machine learning problem. So far, most machine learning approaches to ATE broadly follow the traditional hybrid methodology, by first extracting a list of unique candidate terms, and classifying these candidates based on the predicted probability that they are valid terms. However, with the rise of neural networks and word embeddings, the next development in ATE might be towards sequential approaches, i.e., classifying each occurrence of each token within its original context. To test the validity of such approaches for ATE, two sequential methodologies were developed, evaluated, and compared: one feature-based conditional random fields classifier and one embedding-based recurrent neural network. An additional comparison was added with a machine learning interpretation of the traditional approach. All systems were trained and evaluated on identical data in multiple languages and domains to identify their respective strengths and weaknesses. The sequential methodologies were proven to be valid approaches to ATE, and the neural network even outperformed the more traditional approach. Interestingly, a combination of multiple approaches can outperform all of them separately, showing new ways to push the state-of-the-art in ATE.


Sign in / Sign up

Export Citation Format

Share Document