terminology extraction
Recently Published Documents


TOTAL DOCUMENTS

113
(FIVE YEARS 30)

H-INDEX

10
(FIVE YEARS 1)

2021 ◽  
Vol 12 (5-2021) ◽  
pp. 10-21
Author(s):  
Maksim G. Shishaev ◽  
◽  
Vladimir V. Dikovitsky ◽  
Pavel A. Lomov ◽  
◽  
...  

The paper deals with the task of automated terminology extraction. A two-stage technology for its solution is proposed, based on topic modeling and analyzing the context of the use of lexical units. The results of experimental verification of the technology and the prospects for its further development are presented.


Terminology ◽  
2021 ◽  
Author(s):  
Marta García González

Abstract The paper discusses the main results of an analysis of Spanish accounting terminology, based on the exploitation of three different corpora. The analysis was aimed at measuring the level of terminology variation in Spanish accounting and at assessing the suitability of accounting standards and companies’ financial statements for terminology extraction in the translation of accounting texts. The results evidence a terminological variation of around 25% in international accounting standards and a considerable lack of consistency in the use of accounting terminology in the financial statements of Spanish companies, both in the Spanish originals and in their English translations.


2021 ◽  
pp. 85-92
Author(s):  
Sigita Rackevičienė ◽  
Liudmila Mockienė ◽  
Andrius Utka ◽  
Aivaras Rokas

The aim of the paper is to present a methodological framework for the development of an English-Lithuanian bilingual termbase in the cybersecurity domain, which can be applied as a model for other language pairs and other specialised domains. It is argued that the presented methodological approach can ensure creation of high-quality bilingual termbases even with limited available resources. The paper touches upon the methods and problems of dataset (corpora) compilation, terminology annotation, automatic bilingual term extraction (BiTE) and alignment, knowledge-rich context extraction, and linguistic linked open data (LLOD) technologies. The paper presents theoretical considerations as well as the arguments on the effectiveness of the described methods. The theoretical analysis and a pilot study allow arguing that: 1) a combination of parallel and comparable corpora enable to considerably expand the amount and variety of data sources that can be used for terminology extraction; this methodology is especially important for less-resourced languages which often lack parallel data; 2) deep learning systems trained by using manually annotated data (gold standard corpora) allow effective automatization of extraction of terminological data and metadata, which enables to regularly update termbases with minimised manual input; 3) LLOD technologies enable to integrate the terminological data into the global linguistic data ecosystem and make it reusable, searchable and discoverable across the Web.


Terminology ◽  
2021 ◽  
Vol 27 (2) ◽  
pp. 219-253
Author(s):  
Natalia Rivas ◽  
Gabriel Quiroz ◽  
John Jairo Giraldo

Abstract This paper analyzes nested-abbreviated terms from a linguistic perspective by describing their morphological, syntactic, and semantic features for terminology purposes. Nested-abbreviated terms can be considered as abbreviated forms, either initialisms or acronyms, which have within their meaning another abbreviated term. To carry out the analysis, 433 nested-abbreviated terms were extracted from two specialized dictionaries in English. Data analysis showed that, from the morphological and semantic perspective, nested-abbreviated terms behave like typical abbreviations. Important differences were found from a syntactic standpoint where nested abbreviated terms behave as premodifiers in the noun phrase (NP) in 98.93% of the cases. As this is the first time nested-abbreviated terms are studied, they were not only described but also analyzed and defined. Although the percentage of nested-abbreviated terms obtained from the dictionaries is relatively low, less than 1% of total abbreviations, it was found that it is highly relevant to study this growing phenomenon in specialized languages for terminology extraction, as well as for other purposes.


2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Rodrique Kafando ◽  
Rémy Decoupes ◽  
Sarah Valentin ◽  
Lucile Sautot ◽  
Maguelonne Teisseire ◽  
...  

AbstractHere, we introduce ITEXT-BIO, an intelligent process for biomedical domain terminology extraction from textual documents and subsequent analysis. The proposed methodology consists of two complementary approaches, including free and driven term extraction. The first is based on term extraction with statistical measures, while the second considers morphosyntactic variation rules to extract term variants from the corpus. The combination of two term extraction and analysis strategies is the keystone of ITEXT-BIO. These include combined intra-corpus strategies that enable term extraction and analysis either from a single corpus (intra), or from corpora (inter). We assessed the two approaches, the corpus or corpora to be analysed and the type of statistical measures used. Our experimental findings revealed that the proposed methodology could be used: (1) to efficiently extract representative, discriminant and new terms from a given corpus or corpora, and (2) to provide quantitative and qualitative analyses on these terms regarding the study domain.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Dominika Kováříková

Abstract The method of automatic term recognition based on machine learning is focused primarily on the most important quantitative term attributes. It is able to successfully identify terms and non-terms (with success rate of more than 95 %) and find characteristic features of a term as a terminological unit. A single-word term can be characterized as a word with a low frequency that occurs considerably more often in specialized texts than in non-academic texts, occurs in a small number of disciplines, its distribution in the corpus is uneven as is the distance between its two instances. A multi-word term is a collocation consisting of words with low frequency and contains at least one single-word term. The method is based on quantitative features and it makes it possible to utilize the algorithms in multiple disciplines as well as to create cross-lingual applications (verified on Czech and English).


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Wei Shao ◽  
Bolin Hua ◽  
Linqi Song

Abstract A lot of new scientific documents are being published on various platforms every day. It is more and more imperative to quickly and efficiently discover new words and meanings from these documents. However, most of the related works rely on labeled data, and it is quite difficult to deal with unlabeled new documents efficiently. For this, we have introduced an unsupervised method based on sentence patterns and part of speech (POS) sequences. Our method just needs a few initial learnable patterns to obtain the initial terminology tokens and their POS sequences. In this process, new patterns are constructed and can match more sentences to find more POS sequences of terminology. Finally, we use obtained POS sequences and sentence patterns to extract terminology terms in new scientific text. Experiments on paper abstracts from Web of Knowledge show that this method is practical and can achieve a good performance on our test data.


Sign in / Sign up

Export Citation Format

Share Document