automatic term recognition
Recently Published Documents


TOTAL DOCUMENTS

33
(FIVE YEARS 2)

H-INDEX

10
(FIVE YEARS 0)

2021 ◽  
Vol 2078 (1) ◽  
pp. 012031
Author(s):  
Ani Song ◽  
Xiaoxia Jia ◽  
Wei Jiang

Abstract With the development of military intelligence, higher requirements are put forward for automatic term recognition in military field. In view of the characteristics of flexible and diverse naming of military requirement documents without annotated corpus, the method of this paper uses the existing military domain core database, and matches the data set and core database by Aho-Corasic algorithm and word segmentation technology, so that the terms to be recognized in the data set can be divided into three types. The possible rules of word formation of military terms are summarized and phrases that conform to the rules of word formation are found in the documents as the term candidate set. The core library and TF-IDF method are used to calculate the value of the candidate terms, and the candidate terms whose value is greater than the threshold are selected iteratively as the real terms. The experimental results show that the F1 value of this method reaches 0.719, which is better than the traditional C-value method. Therefore, the method proposed in this paper can achieve better automatic term recognition effect for military requirement documents without annotation.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Dominika Kováříková

Abstract The method of automatic term recognition based on machine learning is focused primarily on the most important quantitative term attributes. It is able to successfully identify terms and non-terms (with success rate of more than 95 %) and find characteristic features of a term as a terminological unit. A single-word term can be characterized as a word with a low frequency that occurs considerably more often in specialized texts than in non-academic texts, occurs in a small number of disciplines, its distribution in the corpus is uneven as is the distance between its two instances. A multi-word term is a collocation consisting of words with low frequency and contains at least one single-word term. The method is based on quantitative features and it makes it possible to utilize the algorithms in multiple disciplines as well as to create cross-lingual applications (verified on Czech and English).


Terminology ◽  
2018 ◽  
Vol 24 (1) ◽  
pp. 41-65 ◽  
Author(s):  
Leonie Grön ◽  
Ann Bertels

Abstract Due to its specific linguistic properties, the language found in clinical records has been characterized as a distinct sublanguage. Even within the clinical domain, though, there are major differences in language use, which has led to more fine-grained distinctions based on medical fields and document types. However, previous work has mostly neglected the influence of term variation. By contrast, we propose to integrate the potential for term variation in the characterization of clinical sublanguages. By analyzing a corpus of clinical records, we show that the different sections of these records vary systematically with regard to their lexical, terminological and semantic composition, as well as their potential for term variation. These properties have implications for automatic term recognition, as they influence the performance of frequency-based term weighting.


Terminology ◽  
2015 ◽  
Vol 21 (2) ◽  
pp. 180-204 ◽  
Author(s):  
Malgorzata Marciniak ◽  
Agnieszka Mykowiecka

Domain corpora are often not very voluminous and even important terms can occur in them not as isolated maximal phrases but only within more complex constructions. Appropriate recognition of nested terms can thus influence the content of the extracted candidate term list and its order. We propose a new method for identifying nested terms based on a combination of two aspects: grammatical correctness and normalised pointwise mutual information (NPMI) counted for all bigrams in a given corpus. NPMI is typically used for recognition of strong word connections, but in our solution we use it to recognise the weakest points to suggest the best place for division of a phrase into two parts. By creating, at most, two nested phrases in each step, we introduce a binary term structure. We test the impact of the proposed method applied, together with the C-value ranking method, to the automatic term recognition task performed on three corpora, two in Polish and one in English.


2015 ◽  
Vol 41 (6) ◽  
pp. 336-349 ◽  
Author(s):  
N. A. Astrakhantsev ◽  
D. G. Fedorenko ◽  
D. Yu. Turdakov

2014 ◽  
Vol 1049-1050 ◽  
pp. 1544-1549
Author(s):  
Wen Xiong

Machine aided human translation (MAHT) for the abstract of patent texts is an important step to the deep processing of the patent data, where the terms have significant application value. This paper investigates the automatic term recognition (ATR), and proposes a new hybrid method based on two-phase analysis and statistic to generate English candidate terms. The segments including stop words were not simply discarded; instead, a rewriting method using beginning patterns, ending patterns, and inner patterns on the phase two was employed for the processing of the segments. In the meantime, generalized statistical measures were used for the evaluation of the candidates such as the generalized mutual information (MI), Log-Likelihood Ratio (LLR), and C-value to filter the low score’s candidate terms and to attain the intersection set of them. The experiments on the patent abstract texts extracted randomly show the availability of the method.


Corpora ◽  
2014 ◽  
Vol 9 (1) ◽  
pp. 83-107 ◽  
Author(s):  
María José Marín

Specialised texts are characterised by, amongst other features, the presence of terminology which conveys domain-specific concepts that are essential for the specialist who is interested in analysing such texts. Automatic Term Recognition methods (ATR) are employed to identify those terms automatically, which is especially helpful in view of the large size of corpora nowadays. However, they tend to concentrate on the identification of Multi-Word Terms (MWTs) neglecting Single-Word Terms (SWTs) to a certain extent. This might be related to the greater number of the former found in fields such as biomedicine. However, so far as legal English is concerned, testing has shown that SWTs represent 65.22 percent of the items in the specialised glossary employed for the evaluation of the ATR methods examined herein. This paper presents the evaluation of five SWT recognition methods, namely, those of Chung (2003) , Drouin (2003) , Kit and Liu (2008) , Keywords (2008), and TF-IDF (term frequency-inverse document frequency). These were tested on the United Kingdom Supreme Court Corpus (UKSCC), a legal corpus of 2.6 million words which was compiled for this purpose. The results indicate that Drouin's TermoStat software is the best performing method, achieving 73.45 percent precision on the top 2,000 candidate terms.


Author(s):  
Ioannis Korkontzelos ◽  
Sophia Ananiadou

Automatic extraction of metadata from free text is key to digesting stored literature information, especially in dynamic and rapidly evolving fields, such as biomedicine. Besides, more and more applications heavily depend on knowledge and ontologies. Successfully recognizing or extracting terms and their relations in scientific and technical documents without human intervention is crucial to semantically structuring literature and populating ontologies. This task has been recognized as the bottleneck in exploiting fields that involve complex and dynamically changing terms, and thus has become an important research topic in Natural Language Processing. This chapter presents a brief but complete overview of automatic term recognition techniques and discusses a number of crucial practical issues. Subsequently, it focuses on evaluation, discusses available resources, and highlights a number of applications.


Author(s):  
Udo Kruschwitz ◽  
Nick Webb ◽  
Richard Sutcliffe

The theme of this chapter is the improvement of Information Retrieval and Question Answering systems by the analysis of query logs. Two case studies are discussed. The first describes an intranet search engine working on a university campus which can present sophisticated query modifications to the user. It does this via a hierarchical domain model built using multi-word term co-occurrence data. The usage log was analysed using mutual information scores between a query and its refinement, between a query and its replacement, and between two queries occurring in the same session. The results can be used to validate refinements in the domain model, and to suggest replacements such as domain-dependent spelling corrections. The second case study describes a dialogue-based question answering system working over a closed document collection largely derived from the Web. Logs here are based around explicit sessions in which an analyst interacts with the system. Analysis of the logs has shown that certain types of interaction lead to increased precision of the results. Future versions of the system will encourage these forms of interaction. The conclusions of this chapter are firstly that there is a growing literature on query log analysis, much of it reviewed here, secondly that logs provide many forms of useful information for improving a system, and thirdly that mutual information measures taken with automatic term recognition algorithms and hierarchy construction techniques comprise one approach for enhancing system performance.


Sign in / Sign up

Export Citation Format

Share Document