Exploiting extra-textual and linguistic information in keyphrase extraction

2014 ◽  
Vol 22 (1) ◽  
pp. 73-95 ◽  
Author(s):  
GÁBOR BEREND

AbstractKeyphrases are the most important phrases of documents that make them suitable for improving natural language processing tasks, including information retrieval, document classification, document visualization, summarization and categorization. Here, we propose a supervised framework augmented by novel extra-textual information derived primarily from Wikipedia. Wikipedia is utilized in such an advantageous way that – unlike most other methods relying on Wikipedia – a full textual index of all the Wikipedia articles is not required by our approach, as we only exploit the category hierarchy and a list of multiword expressions derived from Wikipedia. This approach is not only less resource intensive, but also produces comparable or superior results compared to previous similar works. Our thorough evaluations also suggest that the proposed framework performs consistently well on multiple datasets, being competitive or even outperforming the results obtained by other state-of-the-art methods. Besides introducing features that incorporate extra-textual information, we also experimented with a novel way of representing features that are derived from the POS tagging of the keyphrase candidates.

2019 ◽  
Vol 53 (2) ◽  
pp. 3-10
Author(s):  
Muthu Kumar Chandrasekaran ◽  
Philipp Mayr

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.


2020 ◽  
Author(s):  
Yuqi Kong ◽  
Fanchao Meng ◽  
Ben Carterette

Comparing document semantics is one of the toughest tasks in both Natural Language Processing and Information Retrieval. To date, on one hand, the tools for this task are still rare. On the other hand, most relevant methods are devised from the statistic or the vector space model perspectives but nearly none from a topological perspective. In this paper, we hope to make a different sound. A novel algorithm based on topological persistence for comparing semantics similarity between two documents is proposed. Our experiments are conducted on a document dataset with human judges’ results. A collection of state-of-the-art methods are selected for comparison. The experimental results show that our algorithm can produce highly human-consistent results, and also beats most state-of-the-art methods though ties with NLTK.


2018 ◽  
Author(s):  
Debanjan Mahata ◽  
John Kuriakose ◽  
Rajiv Ratn Shah ◽  
Roger Zimmermann

Keyphrase extraction is a fundamental task in natural language processing that facilitates mapping of documents to a set of representative phrases. In this paper, we present an unsupervised technique (Key2Vec) that leverages phrase embeddings for ranking keyphrases extracted from scientific articles. Specifically, we propose an effective way of processing text documents for training multi-word phrase embeddings that are used for thematic representation of scientific articles and ranking of keyphrases extracted from them using theme-weighted PageRank. Evaluations are performed on benchmark datasets producing state-of-the-art results.


Author(s):  
Isabella Gagliardi ◽  
Maria Teresa Artese

Keyword/keyphrase extraction is an important research activity in text mining, natural language processing, and information retrieval. A large number of algorithms, divided into supervised or unsupervised methods, have been designed and developed to solve the problem of automatic keyphrases extraction. The aim of the chapter is to critically discuss the unsupervised automatic keyphrases extraction algorithms, analyzing in depth their characteristics. The methods presented will be tested on different datasets, presenting in detail the data, the algorithms, and the different options tested in the runs. Moreover, most of the studies and experiments have been conducted on texts in English, while there are few experiments concerning other languages, such as Italian. Particular attention will be paid to the evaluation of the results of the methods in two different languages, English, and Italian.


Author(s):  
Natalia Loukachevitch ◽  
Boris Dobrov

AbstractThis chapter describes the Russian RuThes thesaurus created as a linguistic and terminological resource for automatic document processing. Its structure utilizes two popular paradigms for computer thesauri: concept-based units, a small set of relation types, rules for including multiword expression as in information retrieval thesauri; and language-motivated units, detailed sets of synonyms, description of ambiguous words as in WordNet-like thesauri. The development of the RuThes thesaurus is supported for many years: new concepts, new senses, and multiword expressions found in contemporary texts are introduced regularly. The chapter shows some examples of representing newly appeared concepts related to important internal and international events.


Author(s):  
Zahra Mousavi ◽  
Heshaam Faili

Nowadays, wordnets are extensively used as a major resource in natural language processing and information retrieval tasks. Therefore, the accuracy of wordnets has a direct influence on the performance of the involved applications. This paper presents a fully-automated method for extending a previously developed Persian wordnet to cover more comprehensive and accurate verbal entries. At first, by using a bilingual dictionary, some Persian verbs are linked to Princeton WordNet synsets. A feature set related to the semantic behavior of compound verbs as the majority of Persian verbs is proposed. This feature set is employed in a supervised classification system to select the proper links for inclusion in the wordnet. We also benefit from a pre-existing Persian wordnet, FarsNet, and a similarity-based method to produce a training set. This is the largest automatically developed Persian wordnet with more than 27,000 words, 28,000 PWN synsets and 67,000 word-sense pairs that substantially outperforms the previous Persian wordnet with about 16,000 words, 22,000 PWN synsets and 38,000 word-sense pairs.


2021 ◽  
Vol 11 (6) ◽  
pp. 2663
Author(s):  
Zhengru Shen ◽  
Marco Spruit

The summary of product characteristics from the European Medicines Agency is a reference document on medicines in the EU. It contains textual information for clinical experts on how to safely use medicines, including adverse drug reactions. Using natural language processing (NLP) techniques to automatically extract adverse drug reactions from such unstructured textual information helps clinical experts to effectively and efficiently use them in daily practices. Such techniques have been developed for Structured Product Labels from the Food and Drug Administration (FDA), but there is no research focusing on extracting from the Summary of Product Characteristics. In this work, we built a natural language processing pipeline that automatically scrapes the summary of product characteristics online and then extracts adverse drug reactions from them. Besides, we have made the method and its output publicly available so that it can be reused and further evaluated in clinical practices. In total, we extracted 32,797 common adverse drug reactions for 647 common medicines scraped from the Electronic Medicines Compendium. A manual review of 37 commonly used medicines has indicated a good performance, with a recall and precision of 0.99 and 0.934, respectively.


Sign in / Sign up

Export Citation Format

Share Document