Term extraction from sparse, ungrammatical domain-specific documents

The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small- and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.

Download Full-text

DEXTER: A workbench for automatic term extraction with specialized corpora

Natural Language Engineering ◽

10.1017/s1351324917000365 ◽

2017 ◽

Vol 24 (2) ◽

pp. 163-198 ◽

Cited By ~ 2

Author(s):

CARLOS PERIÑAN-PASCUAL

Keyword(s):

Frequency Distribution ◽

Extensive Literature ◽

Design And Development ◽

Priority Area ◽

Domain Specific ◽

Term Extraction ◽

Automatic Term Extraction ◽

A Priority ◽

Support Research

AbstractAutomatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.

Download Full-text

Domain-specific term extraction from free texts

2012 9th International Conference on Fuzzy Systems and Knowledge Discovery ◽

10.1109/fskd.2012.6234350 ◽

2012 ◽

Cited By ~ 3

Author(s):

Chunxia Zhang ◽

Zhendong Niu ◽

Peng Jiang ◽

Hongping Fu

Keyword(s):

Domain Specific ◽

Term Extraction ◽

Specific Term

Download Full-text

DomSent: Domain-Specific Aspect Term Extraction in Aspect-Based Sentiment Analysis

Smart Systems and IoT: Innovations in Computing - Smart Innovation, Systems and Technologies ◽

10.1007/978-981-13-8406-6_11 ◽

2019 ◽

pp. 103-109 ◽

Cited By ~ 1

Author(s):

Ganpat Singh Chauhan ◽

Yogesh Kumar Meena

Keyword(s):

Sentiment Analysis ◽

Specific Aspect ◽

Domain Specific ◽

Term Extraction

Download Full-text

Improving term candidates selection using terminological tokens

Terminology ◽

10.1075/term.00016.vaz ◽

2018 ◽

Vol 24 (1) ◽

pp. 122-147

Author(s):

Mercè Vàzquez ◽

Antoni Oliver

Keyword(s):

Candidate Selection ◽

Recognition Method ◽

Candidate List ◽

Filtering Method ◽

Domain Specific ◽

Term Extraction ◽

Automatic Term Extraction ◽

Statistical Approaches ◽

Term Candidate ◽

Statistical Term

Abstract The identification of reliable terms from domain-specific corpora using computational methods is a task that has to be validated manually by specialists, which is a highly time-consuming activity. To reduce this effort and improve term candidate selection, we implemented the Token Slot Recognition method, a filtering method based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. This paper presents the implementation of the term candidates filtering method we developed in linguistic and statistical approaches applied for automatic term extraction using several domain-specific corpora in different languages. We observed that the filtering method outperforms term candidate selection by ranking a higher number of terms at the top of the term candidate list than raw frequency, and for statistical term extraction the improvement is between 15% and 25% both in precision and recall. Our analyses further revealed a reduction in the number of term candidates to be validated manually by specialists. In conclusion, the number of term candidates extracted automatically from domain-specific corpora has been reduced significantly using the Token Slot Recognition filtering method, so term candidates can be easily and quickly validated by specialists.

Download Full-text

A Domain-Specific Chinese Term Extraction Method Based on Prefix and Suffix

2012 International Conference on Computer Science and Service System ◽

10.1109/csss.2012.342 ◽

2012 ◽

Author(s):

Dongmei Li ◽

Qinglin Wang ◽

Yuan Li ◽

Qian Peng

Keyword(s):

Extraction Method ◽

Domain Specific ◽

Term Extraction ◽

Chinese Term

Download Full-text

Determination of Unithood and Termhood for Term Recognition

Handbook of Research on Text and Web Mining Technologies ◽

10.4018/978-1-59904-990-8.ch030 ◽

2010 ◽

pp. 500-529 ◽

Cited By ~ 4

Author(s):

Wilson Wong

Keyword(s):

Real Life ◽

Open Problems ◽

New Approach ◽

Domain Specific ◽

Term Extraction ◽

Ranking Scheme ◽

Knowledge Intensive ◽

Automatic Term Recognition ◽

New Scoring

As more electronic text is readily available, and more applications become knowledge intensive and ontology-enabled, term extraction, also known as automatic term recognition or terminology mining is increasingly in demand. This chapter first presents a comprehensive review of the existing techniques, discusses several issues and open problems that prevent such techniques from being practical in real-life applications, and then proposes solutions to address these issues. Keeping afresh with the recent advances in related areas such as text mining, we propose new measures for the determination of unithood, and a new scoring and ranking scheme for measuring termhood to recognise domain-specific terms. The chapter concludes with experiments to demonstrate the advantages of our new approach.

Download Full-text

Domain-specific Chinese Term Extraction via Word Segmentation Optimization

Journal of Information and Computational Science ◽

10.12733/jics20107131 ◽

2015 ◽

Vol 12 (17) ◽

pp. 6477-6490

Author(s):

Chuyuan Wei

Keyword(s):

Word Segmentation ◽

Domain Specific ◽

Term Extraction ◽

Chinese Term

Download Full-text

Extracting bilingual terms from the Web

Terminology ◽

10.1075/term.21.2.04gai ◽

2015 ◽

Vol 21 (2) ◽

pp. 205-236 ◽

Cited By ~ 1

Author(s):

Robert Gaizauskas ◽

Monica Lestari Paramita ◽

Emma Barker ◽

Marcis Pinnis ◽

Ahmet Aker ◽

...

Keyword(s):

Evaluation Method ◽

Data Gathering ◽

Web Data ◽

Component System ◽

Terminology Extraction ◽

Domain Specific ◽

Term Extraction ◽

Main Components ◽

Multi Component System ◽

The Web

In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems.

Download Full-text