Exploiting extra-textual and linguistic information in keyphrase extraction

AbstractKeyphrases are the most important phrases of documents that make them suitable for improving natural language processing tasks, including information retrieval, document classification, document visualization, summarization and categorization. Here, we propose a supervised framework augmented by novel extra-textual information derived primarily from Wikipedia. Wikipedia is utilized in such an advantageous way that – unlike most other methods relying on Wikipedia – a full textual index of all the Wikipedia articles is not required by our approach, as we only exploit the category hierarchy and a list of multiword expressions derived from Wikipedia. This approach is not only less resource intensive, but also produces comparable or superior results compared to previous similar works. Our thorough evaluations also suggest that the proposed framework performs consistently well on multiple datasets, being competitive or even outperforming the results obtained by other state-of-the-art methods. Besides introducing features that incorporate extra-textual information, we also experimented with a novel way of representing features that are derived from the POS tagging of the keyphrase candidates.

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text

A Topological Method for Comparing Document Semantics

10.5121/csit.2020.101411 ◽

2020 ◽

Author(s):

Yuqi Kong ◽

Fanchao Meng ◽

Ben Carterette

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Language Processing ◽

State Of The Art ◽

Vector Space Model ◽

The Other ◽

Space Model ◽

Topological Persistence ◽

Art Methods ◽

Novel Algorithm

Comparing document semantics is one of the toughest tasks in both Natural Language Processing and Information Retrieval. To date, on one hand, the tools for this task are still rare. On the other hand, most relevant methods are devised from the statistic or the vector space model perspectives but nearly none from a topological perspective. In this paper, we hope to make a different sound. A novel algorithm based on topological persistence for comparing semantics similarity between two documents is proposed. Our experiments are conducted on a document dataset with human judges’ results. A collection of state-of-the-art methods are selected for comparison. The experimental results show that our algorithm can produce highly human-consistent results, and also beats most state-of-the-art methods though ties with NLTK.

Download Full-text

Key2Vec: Automatic Ranked Keyphrase Extraction from Scientific Articles using Phrase Embeddings

10.31219/osf.io/j76y3 ◽

2018 ◽

Cited By ~ 1

Author(s):

Debanjan Mahata ◽

John Kuriakose ◽

Rajiv Ratn Shah ◽

Roger Zimmermann

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Keyphrase Extraction ◽

Text Documents ◽

Benchmark Datasets

Keyphrase extraction is a fundamental task in natural language processing that facilitates mapping of documents to a set of representative phrases. In this paper, we present an unsupervised technique (Key2Vec) that leverages phrase embeddings for ranking keyphrases extracted from scientific articles. Specifically, we propose an effective way of processing text documents for training multi-word phrase embeddings that are used for thematic representation of scientific articles and ranking of keyphrases extracted from them using theme-weighted PageRank. Evaluations are performed on benchmark datasets producing state-of-the-art results.

Download Full-text

Unsupervised Automatic Keyphrases Extraction on Italian Datasets

Encyclopedia of Information Science and Technology, Fifth Edition - Advances in Information Quality and Management ◽

10.4018/978-1-7998-3479-3.ch009 ◽

2021 ◽

pp. 107-126

Author(s):

Isabella Gagliardi ◽

Maria Teresa Artese

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Important Research ◽

Keyphrase Extraction ◽

Research Activity ◽

Unsupervised Methods

Keyword/keyphrase extraction is an important research activity in text mining, natural language processing, and information retrieval. A large number of algorithms, divided into supervised or unsupervised methods, have been designed and developed to solve the problem of automatic keyphrases extraction. The aim of the chapter is to critically discuss the unsupervised automatic keyphrases extraction algorithms, analyzing in depth their characteristics. The methods presented will be tested on different datasets, presenting in detail the data, the algorithms, and the different options tested in the runs. Moreover, most of the studies and experiments have been conducted on texts in English, while there are few experiments concerning other languages, such as Italian. Particular attention will be paid to the evaluation of the results of the methods in two different languages, English, and Italian.

Download Full-text

RuThes Thesaurus for Natural Language Processing

The Palgrave Handbook of Digital Russia Studies ◽

10.1007/978-3-030-42855-6_18 ◽

2020 ◽

pp. 319-334

Author(s):

Natalia Loukachevitch ◽

Boris Dobrov

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Document Processing ◽

Multiword Expressions ◽

Ambiguous Words ◽

New Concepts ◽

Information Retrieval Thesauri ◽

Small Set

AbstractThis chapter describes the Russian RuThes thesaurus created as a linguistic and terminological resource for automatic document processing. Its structure utilizes two popular paradigms for computer thesauri: concept-based units, a small set of relation types, rules for including multiword expression as in information retrieval thesauri; and language-motivated units, detailed sets of synonyms, description of ambiguous words as in WordNet-like thesauri. The development of the RuThes thesaurus is supported for many years: new concepts, new senses, and multiword expressions found in contemporary texts are introduced regularly. The chapter shows some examples of representing newly appeared concepts related to important internal and international events.

Download Full-text

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

10.1145/3342827 ◽

2019 ◽

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

International Conference

Download Full-text

Thai Fake News Detection Based on Information Retrieval, Natural Language Processing and Machine Learning

SN Computer Science ◽

10.1007/s42979-021-00775-6 ◽

2021 ◽

Vol 2 (6) ◽

Author(s):

Phayung Meesad

Keyword(s):

Machine Learning ◽

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Fake News

Download Full-text

Developing the Persian Wordnet of Verbs Using Supervised Learning

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3450969 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-18

Author(s):

Zahra Mousavi ◽

Heshaam Faili

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Language Processing ◽

Supervised Classification ◽

Word Sense ◽

Direct Influence ◽

Training Set ◽

Bilingual Dictionary ◽

Automated Method ◽

Princeton Wordnet

Nowadays, wordnets are extensively used as a major resource in natural language processing and information retrieval tasks. Therefore, the accuracy of wordnets has a direct influence on the performance of the involved applications. This paper presents a fully-automated method for extending a previously developed Persian wordnet to cover more comprehensive and accurate verbal entries. At first, by using a bilingual dictionary, some Persian verbs are linked to Princeton WordNet synsets. A feature set related to the semantic behavior of compound verbs as the majority of Persian verbs is proposed. This feature set is employed in a supervised classification system to select the proper links for inclusion in the wordnet. We also benefit from a pre-existing Persian wordnet, FarsNet, and a similarity-based method to produce a training set. This is the largest automatically developed Persian wordnet with more than 27,000 words, 28,000 PWN synsets and 67,000 word-sense pairs that substantially outperforms the previous Persian wordnet with about 16,000 words, 22,000 PWN synsets and 38,000 word-sense pairs.

Download Full-text

Automatic Extraction of Adverse Drug Reactions from Summary of Product Characteristics

Applied Sciences ◽

10.3390/app11062663 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2663

Author(s):

Zhengru Shen ◽

Marco Spruit

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Adverse Drug Reactions ◽

Language Processing ◽

European Medicines Agency ◽

Drug Reactions ◽

Clinical Practices ◽

Product Characteristics ◽

Textual Information ◽

Summary Of Product Characteristics

The summary of product characteristics from the European Medicines Agency is a reference document on medicines in the EU. It contains textual information for clinical experts on how to safely use medicines, including adverse drug reactions. Using natural language processing (NLP) techniques to automatically extract adverse drug reactions from such unstructured textual information helps clinical experts to effectively and efficiently use them in daily practices. Such techniques have been developed for Structured Product Labels from the Food and Drug Administration (FDA), but there is no research focusing on extracting from the Summary of Product Characteristics. In this work, we built a natural language processing pipeline that automatically scrapes the summary of product characteristics online and then extracts adverse drug reactions from them. Besides, we have made the method and its output publicly available so that it can be reused and further evaluated in clinical practices. In total, we extracted 32,797 common adverse drug reactions for 647 common medicines scraped from the Electronic Medicines Compendium. A manual review of 37 commonly used medicines has indicated a good performance, with a recall and precision of 0.99 and 0.934, respectively.

Download Full-text

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Computer Speech & Language ◽

10.1016/j.csl.2013.09.001 ◽

2014 ◽

Vol 28 (6) ◽

pp. 1317-1339 ◽

Cited By ~ 4

Author(s):

Toshifumi Tanabe ◽

Masahito Takahashi ◽

Kosho Shudo

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Multiword Expressions

Download Full-text