scholarly journals ITEXT-BIO: Intelligent Term EXTraction for BIOmedical analysis

2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Rodrique Kafando ◽  
Rémy Decoupes ◽  
Sarah Valentin ◽  
Lucile Sautot ◽  
Maguelonne Teisseire ◽  
...  

AbstractHere, we introduce ITEXT-BIO, an intelligent process for biomedical domain terminology extraction from textual documents and subsequent analysis. The proposed methodology consists of two complementary approaches, including free and driven term extraction. The first is based on term extraction with statistical measures, while the second considers morphosyntactic variation rules to extract term variants from the corpus. The combination of two term extraction and analysis strategies is the keystone of ITEXT-BIO. These include combined intra-corpus strategies that enable term extraction and analysis either from a single corpus (intra), or from corpora (inter). We assessed the two approaches, the corpus or corpora to be analysed and the type of statistical measures used. Our experimental findings revealed that the proposed methodology could be used: (1) to efficiently extract representative, discriminant and new terms from a given corpus or corpora, and (2) to provide quantitative and qualitative analyses on these terms regarding the study domain.

2020 ◽  
Vol 26 (4) ◽  
pp. 455-479
Author(s):  
Branislava Šandrih ◽  
Cvetana Krstev ◽  
Ranka Stanković

AbstractIn this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being varied. In the experiments presented in this paper, the source language was English, and the target language Serbian, and a selected domain was Library and Information Science, for which an aligned corpus exists, as well as a bilingual terminological dictionary. For term extraction, we used the FlexiTerm tool for the source language and a shallow parser for the target language, while for word alignment we used GIZA++. The evaluation results show that for the first approach the F1 score varies from 29.43% to 51.15%, while for the second it varies from 61.03% to 71.03%. On the basis of the evaluation results, we developed a binary classifier that decides whether a candidate pair, composed of aligned source and target terms, is valid. We trained and evaluated different classifiers on a list of manually labeled candidate pairs obtained after the implementation of our extraction system. The best results in a fivefold cross-validation setting were achieved with the Radial Basis Function Support Vector Machine classifier, giving a F1 score of 82.09% and accuracy of 78.49%.


2011 ◽  
Vol 11 (2) ◽  
pp. 159 ◽  
Author(s):  
Rogelio Nazar

This paper argues in favor of a statistical approach to terminology extraction, general to all languages but with language specific parameters. In contrast to many application-oriented terminology studies, which are focused on a particular language and domain, this paper adopts some general principles of the statistical properties of terms and a method to obtain the corresponding language specific parameters. This method is used for the automatic identification of terminology and is quantitatively evaluated in an empirical study of English medical terms. The proposal is theoretically and computationally simple and disregards resources such as linguistic or ontological knowledge. The algorithm learns to identify terms during a training phase where it is shown examples of both terminological and non-terminological units. With these examples, the algorithm creates a model of the terminology that accounts for the frequency of lexical, morphological and syntactic elements of the terms in relation to the non-terminological vocabulary. The model is then used for the later identification of new terminology in previously unseen text. The comparative evaluation shows that performance is significantly higher than other well-known systems.


2013 ◽  
Vol 70 (1) ◽  
pp. 157-172 ◽  
Author(s):  
Wiktoria Golik ◽  
Robert Bossy ◽  
Zorana Ratkovic ◽  
Claire Nédellec

Terminology ◽  
2003 ◽  
Vol 9 (1) ◽  
pp. 51-69 ◽  
Author(s):  
Arendse Bernth ◽  
Michael McCord ◽  
Kara Warburton

The role of terminology in content management has often been underrated. Term extraction has been identified by the information industry as an area requiring focus. Term extraction benefits both the content authoring and the translation process. Supplying key product terms to translation services several weeks before the actual translation begins reduces translation time, improves translation quality, and saves effort (and thus money) by reducing duplication of work. Getting the key terms ready in a timely manner can be difficult without some automation. This paper describes the process of proposing, designing, developing, and deploying a terminology extraction tool. The tool extracts nouns and noun groups, excludes non-translatable terms and known product terms, and displays a context for each extracted item. This is done based on full parsing of the text with a broad-coverage parser. The tool is made available to users on a Web server.


2021 ◽  
pp. 85-92
Author(s):  
Sigita Rackevičienė ◽  
Liudmila Mockienė ◽  
Andrius Utka ◽  
Aivaras Rokas

The aim of the paper is to present a methodological framework for the development of an English-Lithuanian bilingual termbase in the cybersecurity domain, which can be applied as a model for other language pairs and other specialised domains. It is argued that the presented methodological approach can ensure creation of high-quality bilingual termbases even with limited available resources. The paper touches upon the methods and problems of dataset (corpora) compilation, terminology annotation, automatic bilingual term extraction (BiTE) and alignment, knowledge-rich context extraction, and linguistic linked open data (LLOD) technologies. The paper presents theoretical considerations as well as the arguments on the effectiveness of the described methods. The theoretical analysis and a pilot study allow arguing that: 1) a combination of parallel and comparable corpora enable to considerably expand the amount and variety of data sources that can be used for terminology extraction; this methodology is especially important for less-resourced languages which often lack parallel data; 2) deep learning systems trained by using manually annotated data (gold standard corpora) allow effective automatization of extraction of terminological data and metadata, which enables to regularly update termbases with minimised manual input; 3) LLOD technologies enable to integrate the terminological data into the global linguistic data ecosystem and make it reusable, searchable and discoverable across the Web.


Electronics ◽  
2020 ◽  
Vol 9 (4) ◽  
pp. 608 ◽  
Author(s):  
HoSung Woo ◽  
JaMee Kim ◽  
WonGyu Lee

The principles of computer skills have been included in primary and secondary educated since the early 2000s, and the reform of curricula is related to the development of IT. Therefore, curricula should reflect the latest technological trends and needs of society. The development of a curriculum involves the subjective judgment of a few experts or professors to extract knowledge from several similar documents. More objective extraction needs to be based on standardized terminology, and professional terminology can help build content frames for organizing curricula. The purpose of this study is to develop a smart system for extracting terms from the body of computer science (CS) knowledge and organizing knowledge areas. The extracted terms are composed of semantically similar knowledge areas, using the word2vec model. We analyzed a higher-education CS standards document and compiled a dictionary of technical terms with a hierarchical clustering structure. Based on the developed terminology dictionary, a specialized system is proposed to enhance the efficiency and objectivity of terminology extraction. The analysis of high school education courses in India and Israel using the technical term extraction system found that (1) technical terms for Software Development Fundamentals were extracted at a high rate in entry-level courses, (2) in advanced courses, the ratio of technical terms in the areas of Architecture and Organization, Programming Languages, and Software Engineering areas was high, and (3) electives that deal with advanced content had a high percentage of technical terms related to information systems.


Terminology ◽  
2015 ◽  
Vol 21 (2) ◽  
pp. 205-236 ◽  
Author(s):  
Robert Gaizauskas ◽  
Monica Lestari Paramita ◽  
Emma Barker ◽  
Marcis Pinnis ◽  
Ahmet Aker ◽  
...  

In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems.


2016 ◽  
Vol 36 (1) ◽  
pp. 62 ◽  
Author(s):  
Claudio Fantinuoli

http://dx.doi.org/10.5007/2175-7968.2016v36nesp1p62Many translation scholars have proposed the use of corpora to allow professional translators to produce high quality texts which read like originals. Yet, the diffusion of this methodology has been modest, one reason being the fact that software for corpora analyses have been developed with the linguist in mind, which means that they are generally complex and cumbersome, offering many advanced features, but lacking the level of usability and the specific features that meet translators’ needs. To overcome this shortcoming, we have developed TranslatorBank, a free corpus creation and analysis tool designed for translation tasks. TranslatorBank supports the creation of specialized monolingual corpora from the web; it includes a concordancer with a query system similar to a search engine; it uses basic statistical measures to indicate the reliability of results; it accesses the original documents directly for more contextual information; it includes a statistical and linguistic terminology extraction utility to extract the relevant terminology of the domain and the typical collocations of a given term. Designed to be easy and intuitive to use, the tool may help translation students as well as professionals to increase their translation quality by adhering to the specific linguistic variety of the target text corpus.


Sign in / Sign up

Export Citation Format

Share Document