Improving term candidates selection using terminological
                    tokens

Abstract The identification of reliable terms from domain-specific corpora using computational methods is a task that has to be validated manually by specialists, which is a highly time-consuming activity. To reduce this effort and improve term candidate selection, we implemented the Token Slot Recognition method, a filtering method based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. This paper presents the implementation of the term candidates filtering method we developed in linguistic and statistical approaches applied for automatic term extraction using several domain-specific corpora in different languages. We observed that the filtering method outperforms term candidate selection by ranking a higher number of terms at the top of the term candidate list than raw frequency, and for statistical term extraction the improvement is between 15% and 25% both in precision and recall. Our analyses further revealed a reduction in the number of term candidates to be validated manually by specialists. In conclusion, the number of term candidates extracted automatically from domain-specific corpora has been reduced significantly using the Token Slot Recognition filtering method, so term candidates can be easily and quickly validated by specialists.

Download Full-text

The underpinnings of a composite measure for automatic term extraction

Terminology ◽

10.1075/term.21.2.02per ◽

2015 ◽

Vol 21 (2) ◽

pp. 151-179 ◽

Cited By ~ 2

Author(s):

Carlos Periñán-Pascual

Keyword(s):

Complex Task ◽

Composite Measure ◽

Domain Specific ◽

Term Extraction ◽

Automatic Term Extraction ◽

Three Components

The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small- and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.

Download Full-text

DEXTER: A workbench for automatic term extraction with specialized corpora

Natural Language Engineering ◽

10.1017/s1351324917000365 ◽

2017 ◽

Vol 24 (2) ◽

pp. 163-198 ◽

Cited By ~ 2

Author(s):

CARLOS PERIÑAN-PASCUAL

Keyword(s):

Frequency Distribution ◽

Extensive Literature ◽

Design And Development ◽

Priority Area ◽

Domain Specific ◽

Term Extraction ◽

Automatic Term Extraction ◽

A Priority ◽

Support Research

AbstractAutomatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.

Download Full-text

Automatic Term Extraction on Turkish Scientific Texts

2020 International Conference on Decision Aid Sciences and Application (DASA) ◽

10.1109/dasa51403.2020.9317125 ◽

2020 ◽

Author(s):

Irfan Aygun ◽

Mehmet Kaya

Keyword(s):

Scientific Texts ◽

Term Extraction ◽

Automatic Term Extraction

Download Full-text

Term extraction and management based on event templates

Terminology ◽

10.1075/term.20.2.02per ◽

2014 ◽

Vol 20 (2) ◽

pp. 151-170 ◽

Cited By ~ 1

Author(s):

Katia Peruzzo

Keyword(s):

European Union ◽

Knowledge Base ◽

Legal System ◽

Classification Criteria ◽

The European Union ◽

Additional Information ◽

Term Extraction ◽

Automatic Term Extraction ◽

Multi Level ◽

Victims Of Crime

The paper examines the possible usage of event templates derived from Frame-Based Terminology (Faber et al. 2005, 2006, 2007) as an aid to the extraction and management of legal terminology embedded in the multi-level legal system of the European Union. The method proposed here, which combines semi-automatic term extraction and a simplified event template containing six categories, is applied to an English corpus of EU texts focusing on victims of crime and their rights. Such a combination allows for the extraction of category-relevant terminological units and additional information, which can then be used for populating a terminological knowledge base organised on the basis of the same event template, but which also employs additional classification criteria to account for the multidimensionality encountered in the corpus.

Download Full-text

Automatic Term Extraction for Sentiment Classification of Dynamically Updated Text Collections into Three Classes

Knowledge Engineering and the Semantic Web - Communications in Computer and Information Science ◽

10.1007/978-3-319-11716-4_12 ◽

2014 ◽

pp. 140-149 ◽

Cited By ~ 2

Author(s):

Yuliya Rubtsova

Keyword(s):

Sentiment Classification ◽

Term Extraction ◽

Text Collections ◽

Automatic Term Extraction

Download Full-text

Tagging terms in text

Terminology ◽

10.1075/term.21010.rig ◽

2022 ◽

Author(s):

Ayla Rigouts Terryn ◽

Véronique Hoste ◽

Els Lefever

Keyword(s):

Neural Network ◽

Machine Learning ◽

Language Processing ◽

Conditional Random Fields ◽

Traditional Approach ◽

Learning Approaches ◽

Term Extraction ◽

The Neural Network ◽

Predicted Probability ◽

Automatic Term Extraction

Abstract As with many tasks in natural language processing, automatic term extraction (ATE) is increasingly approached as a machine learning problem. So far, most machine learning approaches to ATE broadly follow the traditional hybrid methodology, by first extracting a list of unique candidate terms, and classifying these candidates based on the predicted probability that they are valid terms. However, with the rise of neural networks and word embeddings, the next development in ATE might be towards sequential approaches, i.e., classifying each occurrence of each token within its original context. To test the validity of such approaches for ATE, two sequential methodologies were developed, evaluated, and compared: one feature-based conditional random fields classifier and one embedding-based recurrent neural network. An additional comparison was added with a machine learning interpretation of the traditional approach. All systems were trained and evaluated on identical data in multiple languages and domains to identify their respective strengths and weaknesses. The sequential methodologies were proven to be valid approaches to ATE, and the neural network even outperformed the more traditional approach. Interestingly, a combination of multiple approaches can outperform all of them separately, showing new ways to push the state-of-the-art in ATE.

Download Full-text