Developing the Persian Wordnet of Verbs Using Supervised Learning

Author(s):  
Zahra Mousavi ◽  
Heshaam Faili

Nowadays, wordnets are extensively used as a major resource in natural language processing and information retrieval tasks. Therefore, the accuracy of wordnets has a direct influence on the performance of the involved applications. This paper presents a fully-automated method for extending a previously developed Persian wordnet to cover more comprehensive and accurate verbal entries. At first, by using a bilingual dictionary, some Persian verbs are linked to Princeton WordNet synsets. A feature set related to the semantic behavior of compound verbs as the majority of Persian verbs is proposed. This feature set is employed in a supervised classification system to select the proper links for inclusion in the wordnet. We also benefit from a pre-existing Persian wordnet, FarsNet, and a similarity-based method to produce a training set. This is the largest automatically developed Persian wordnet with more than 27,000 words, 28,000 PWN synsets and 67,000 word-sense pairs that substantially outperforms the previous Persian wordnet with about 16,000 words, 22,000 PWN synsets and 38,000 word-sense pairs.

Author(s):  
RADA MIHALCEA ◽  
DAN I. MOLDOVAN

Many natural language processing tasks, such as word sense disambiguation, knowledge acquisition, information retrieval, use semantically tagged corpora. Till recently, these corpus-based systems relied on text manually annotated with semantic tags; but the massive human intervention in this process has become a serious impediment in building robust systems. In this paper, we present AutoASC, a system which automatically acquires sense tagged corpora. It is based on (1) the information provided in WordNet, particularly the word definitions found within the glosses and (2) the information gathered from Internet using existing search engines. The system was tested on a set of 46 concepts, for which 2071 example sentences have been acquired; for these, a precision of 87% was observed.


2019 ◽  
Vol 53 (2) ◽  
pp. 3-10
Author(s):  
Muthu Kumar Chandrasekaran ◽  
Philipp Mayr

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.


2018 ◽  
Vol 10 (10) ◽  
pp. 3729 ◽  
Author(s):  
Hei Wang ◽  
Yung Chi ◽  
Ping Hsin

With the advent of the knowledge economy, firms often compete for intellectual property rights. Being the first to acquire high-potential patents can assist firms in achieving future competitive advantages. To identify patents capable of being developed, firms often search for a focus by using existing patent documents. Because of the rapid development of technology, the number of patent documents is immense. A prominent topic among current firms is how to use this large number of patent documents to discover new business opportunities while avoiding conflicts with existing patents. In the search for technological opportunities, a crucial task is to present results in the form of an easily understood visualization. Currently, natural language processing can help in achieving this goal. In natural language processing, word sense disambiguation (WSD) is the problem of determining which “sense” (meaning) of a word is activated in a given context. Given a word and its possible senses, as defined by a dictionary, we classify the occurrence of a word in context into one or more of its sense classes. The features of the context (such as neighboring words) provide evidence for these classifications. The current method for patent document analysis warrants improvement in areas, such as the analysis of many dimensions and the development of recommendation methods. This study proposes a visualization method that supports semantics, reduces the number of dimensions formed by terms, and can easily be understood by users. Since polysemous words occur frequently in patent documents, we also propose a WSD method to decrease the calculated degrees of distortion between terms. An analysis of outlier distributions is used to construct a patent map capable of distinguishing similar patents. During the development of new strategies, the constructed patent map can assist firms in understanding patent distributions in commercial areas, thereby preventing patent infringement caused by the development of similar technologies. Subsequently, technological opportunities can be recommended according to the patent map, aiding firms in assessing relevant patents in commercial areas early and sustainably achieving future competitive advantages.


Author(s):  
Sijia Liu ◽  
Yanshan Wang ◽  
Andrew Wen ◽  
Liwei Wang ◽  
Na Hong ◽  
...  

BACKGROUND Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records. OBJECTIVE In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text—Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE). METHODS CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks. RESULTS Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively. CONCLUSIONS The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.


2015 ◽  
Vol 103 (1) ◽  
pp. 131-138 ◽  
Author(s):  
Yves Bestgen

Abstract Average precision (AP) is one of the most widely used metrics in information retrieval and natural language processing research. It is usually thought that the expected AP of a system that ranks documents randomly is equal to the proportion of relevant documents in the collection. This paper shows that this value is only approximate, and provides a procedure for efficiently computing the exact value. An analysis of the difference between the approximate and the exact value shows that the discrepancy is large when the collection contains few documents, but becomes very small when it contains at least 600 documents.


Sign in / Sign up

Export Citation Format

Share Document