scholarly journals Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus

2021 ◽  
Vol 3 (1) ◽  
pp. 263-283
Author(s):  
Callum Hughes ◽  
Maxim Filimonov ◽  
Alison Wray ◽  
Irena Spasić

Idioms are multi-word expressions whose meaning cannot always be deduced from the literal meaning of constituent words. A key feature of idioms that is central to this paper is their peculiar mixture of fixedness and variability, which poses challenges for their retrieval from large corpora using traditional search approaches. These challenges hinder insights into idiom usage, affecting users who are conducting linguistic research as well as those involved in language education. To facilitate access to idiom examples taken from real-world contexts, we introduce an information retrieval system designed specifically for idioms. Given a search query that represents an idiom, typically in its canonical form, the system expands it automatically to account for the most common types of idiom variation including inflection, open slots, adjectival or adverbial modification and passivisation. As a by-product of query expansion, other types of idiom variation captured include derivation, compounding, negation, distribution across multiple clauses as well as other unforeseen types of variation. The system was implemented on top of Elasticsearch, an open-source, distributed, scalable, real-time search engine. Flexible retrieval of idioms is supported by a combination of linguistic pre-processing of the search queries, their translation into a set of query clauses written in a query language called Query DSL, and analysis, an indexing process that involves tokenisation and normalisation. Our system outperformed the phrase search in terms of recall and outperformed the keyword search in terms of precision. Out of the three, our approach was found to provide the best balance between precision and recall. By providing a fast and easy way of finding idioms in large corpora, our approach can facilitate further developments in fields such as linguistics, language education and natural language processing.

Now-a-days digital documents are playing a major role in all the areas /web, as such all the information is digitalised. Queries are used by the search engines to retrieve the information. Query plays a major role in information retrieval system, as a result relevant and non relevant documents are retrieved. Query expansion techniques will better the performance of the information retrieval system. Our proposed query expansion technique is Word Sense Disambiguation. This is to find the correct sense of the ambiguous word in regional Telugu language. In Query expansion, if the added query term is an ambiguous word, accuracy of relevant documents will be very less. So to avoid this, proposed method Word Sense Disambiguation (WSD) is used, which is related to NLP Natural Language Processing and Artificial Intelligence AI. WSD improves the accuracy of information retrieval system.


Ontology provide a structured way of describing knowledge. Ontology's are usually repositories of concepts and relations between them, so using them in information retrieval seems to be a reasonable goal. The main objective in this report is to provide efficient means to move from keyword-based to concept-based information retrieval utilizing ontology's for conceptual definitions [1]. In this paper, we present the skeleton of such an IR system which works on a collection of domain specific documents and exploits the use of a domain specific ontology to improve the overall number of relevant documents retrieved. In this system, a user enters a query from which the meaningful concepts are extracted; using these concepts and domain ontology, query expansion is performed. We propose a system that matches the query terms in the ontology/schema graph and exploits the surrounding knowledge to derive an enhanced query. The enhanced query is given to the underlying basic keyword search system LUCENE [2]. In this approach we try to make use of more ontological Knowledge than IS-A and HAS-A relationships and synonyms for information retrieval.


Author(s):  
Jiangning Wu ◽  
Hiroki Tanioka ◽  
Shizhu Wang ◽  
Donghua Pan ◽  
Kenichi Yamamoto ◽  
...  

Author(s):  
Bilel Elayeb ◽  
Ibrahim Bounhas ◽  
Oussama Ben Khiroun ◽  
Fabrice Evrard ◽  
Narjès Bellamine-BenSaoud

This paper presents a new possibilistic information retrieval system using semantic query expansion. The work is involved in query expansion strategies based on external linguistic resources. In this case, the authors exploited the French dictionary “Le Grand Robert”. First, they model the dictionary as a graph and compute similarities between query terms by exploiting the circuits in the graph. Second, the possibility theory is used by taking advantage of a double relevance measure (possibility and necessity) between the articles of the dictionary and query terms. Third, these two approaches are combined by using two different aggregation methods. The authors also benefit from an existing approach for reweighting query terms in the possibilistic matching model to improve the expansion process. In order to assess and compare the approaches, the authors performed experiments on the standard ‘LeMonde94’ test collection.


2018 ◽  
Author(s):  
Fabiano Tavares Da Silva ◽  
José Everardo Bessa Maia

This article presents Luppar, an Information Retrieval tool for closed collections of documents which uses a local distributional semantic model associated to each corpus. The system performs automatic query expansion using a combination of distributional semantic model and local context analysis and supports relevancy feedback. The performance of the system was evaluated in databases of different domains and presented results equal to or higher than those published in the literature.


2017 ◽  
Vol 13 (3) ◽  
pp. 57-78 ◽  
Author(s):  
Jagendra Singh ◽  
Rakesh Kumar

Query expansion (QE) is an efficient method for enhancing the efficiency of information retrieval system. In this work, we try to capture the limitations of pseudo-feedback based QE approach and propose a hybrid approach for enhancing the efficiency of feedback based QE by combining corpus-based, contextual based information of query terms, and semantic based knowledge of query terms. First of all, this paper explores the use of different corpus-based lexical co-occurrence approaches to select an optimal combination of query terms from a pool of terms obtained using pseudo-feedback based QE. Next, we explore semantic similarity approach based on word2vec for ranking the QE terms obtained from top pseudo-feedback documents. Further, we combine co-occurrence statistics, contextual window statistics, and semantic similarity based approaches together to select the best expansion terms for query reformulation. The experiments were performed on FIRE ad-hoc and TREC-3 benchmark datasets. The statistics of our proposed experimental results show significant improvement over baseline method.


2001 ◽  
Vol 7 (2) ◽  
pp. 117-142 ◽  
Author(s):  
DAVID ELWORTHY ◽  
TONY ROSE ◽  
AMANDA CLARE ◽  
AARON KOTCHEFF

ANVIL is an information retrieval system using natural language processing techniques, intended for retrieval of captioned images. It extracts dependency structures from the image captions and user queries, and then applies a high accuracy matching algorithm which recursively explores the dependency structures to determine their similarity. A further algorithm allows additional contextual information to be extracted following a successful match, with the intention of helping users understand and organise the retrieval results. ANVIL was developed to high engineering standards, and as well as looking at the research aspects of the system, we also look at some of the design and development issues. English and Japanese versions of the system have been developed.


Sign in / Sign up

Export Citation Format

Share Document