Word-embedding-based query expansion: Incorporating Deep Averaging Networks in Arabic document retrieval

2021 ◽  
pp. 016555152110406
Author(s):  
Yasir Hadi Farhan ◽  
Shahrul Azman Mohd Noah ◽  
Masnizah Mohd ◽  
Jaffar Atwan

One of the main issues associated with search engines is the query–document vocabulary mismatch problem, a long-standing problem in Information Retrieval (IR). This problem occurs when a user query does not match the content of stored documents, and it affects most search tasks. Automatic query expansion (AQE) is one of the most common approaches used to address this problem. Various AQE techniques have been proposed; these mainly involve finding synonyms or related words for the query terms. Word embedding (WE) is one of the methods that are currently receiving significant attention. Most of the existing AQE techniques focus on expanding the individual query terms rather the entire query during the expansion process, and this can lead to query drift if poor expansion terms are selected. In this article, we introduce Deep Averaging Networks (DANs), an architecture that feeds the average of the WE vectors produced by the Word2Vec toolkit for the terms in a query through several linear neural network layers. This average vector is assumed to represent the meaning of the query as a whole and can be used to find expansion terms that are relevant to the complete query. We explore the potential of DANs for AQE in Arabic document retrieval. We experiment with using DANs for AQE in the classic probabilistic BM25 model as well as for two recent expansion strategies: Embedding-Based Query Expansion approach (EQE1) and Prospect-Guided Query Expansion Strategy (V2Q). Although DANs did not improve all outcomes when used in the BM25 model, it outperformed all baselines when incorporated into the EQE1 and V2Q expansion strategies.

2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Jagendra Singh ◽  
Aditi Sharan

Pseudo-Relevance Feedback (PRF) is a well-known method of query expansion for improving the performance of information retrieval systems. All the terms of PRF documents are not important for expanding the user query. Therefore selection of proper expansion term is very important for improving system performance. Individual query expansion terms selection methods have been widely investigated for improving its performance. Every individual expansion term selection method has its own weaknesses and strengths. To overcome the weaknesses and to utilize the strengths of the individual method, we used multiple terms selection methods together. In this paper, first the possibility of improving the overall performance using individual query expansion terms selection methods has been explored. Second, Borda count rank aggregation approach is used for combining multiple query expansion terms selection methods. Third, the semantic similarity approach is used to select semantically similar terms with the query after applying Borda count ranks combining approach. Our experimental results demonstrated that our proposed approaches achieved a significant improvement over individual terms selection method and related state-of-the-art methods.


2014 ◽  
Vol 28 (4) ◽  
pp. 344-359 ◽  
Author(s):  
Gyeong June Hahm ◽  
Mun Yong Yi ◽  
Jae Hyun Lee ◽  
Hyo Won Suh

2008 ◽  
Author(s):  
Makoto Terao ◽  
Takafumi Koshinaka ◽  
Shinichi Ando ◽  
Ryosuke Isotani ◽  
Akitoshi Okumura

2020 ◽  
Vol 125 (3) ◽  
pp. 3017-3046 ◽  
Author(s):  
André Greiner-Petter ◽  
Abdou Youssef ◽  
Terry Ruas ◽  
Bruce R. Miller ◽  
Moritz Schubotz ◽  
...  

AbstractWord embedding, which represents individual words with semantically fixed-length vectors, has made it possible to successfully apply deep learning to natural language processing tasks such as semantic role-modeling, question answering, and machine translation. As math text consists of natural text, as well as math expressions that similarly exhibit linear correlation and contextual characteristics, word embedding techniques can also be applied to math documents. However, while mathematics is a precise and accurate science, it is usually expressed through imprecise and less accurate descriptions, contributing to the relative dearth of machine learning applications for information retrieval in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in word embedding, it is worthwhile to explore their use and effectiveness in math information retrieval tasks, such as math language processing and semantic knowledge extraction. In this paper, we explore math embedding by testing it on several different scenarios, namely, (1) math-term similarity, (2) analogy, (3) numerical concept-modeling based on the centroid of the keywords that characterize a concept, (4) math search using query expansions, and (5) semantic extraction, i.e., extracting descriptive phrases for math expressions. Due to the lack of benchmarks, our investigations were performed using the arXiv collection of STEM documents and carefully selected illustrations on the Digital Library of Mathematical Functions (DLMF: NIST digital library of mathematical functions. Release 1.0.20 of 2018-09-1, 2018). Our results show that math embedding holds much promise for similarity, analogy, and search tasks. However, we also observed the need for more robust math embedding approaches. Moreover, we explore and discuss fundamental issues that we believe thwart the progress in mathematical information retrieval in the direction of machine learning.


Author(s):  
Qian Gao ◽  
◽  
Young Im Cho ◽  

This paper proposes a multi-agent query refinement approach to realize personalized query expansion effective for academic paper retrieval in a Big Data environment. First, we use Hadoop as a platform to develop a formalized model to represent different types of large caches of data in order to analyze and process Big Data efficiently. Second, we use a client agent to verify user identities and monitor whether a device is ready to run a query-expanded task. We then use a query expansion agent to determine the domain that the initial query belongs to by applying a knowledgebased query expansion strategy and comprehensively considering users’ interests according to the intelligent devices they use by implementing a user-device-based query expansion strategy and a weighted query expansion strategy in order to obtain the optimized query expansion set. We compare our method with the conceptual retrieval method as well as other two lexical methods for query expansion, and we prove that our method has better average recall and average precision ratios.


Author(s):  
Weizhong Zhu ◽  
Xuheng Xu ◽  
Xiaohua Hu ◽  
Il-Yeol Song ◽  
R.B. Allen

2015 ◽  
Vol 6 (3) ◽  
Author(s):  
Resti Ludviani ◽  
Khadijah F. Hayati ◽  
Agus Zainal Arifin ◽  
Diana Purwitasari

Abstract. An appropriate selection term for expanding a query is very important in query expansion. Therefore, term selection optimization is added to improve query expansion performance on document retrieval system. This study proposes a new approach named Term Relatedness to Query-Entropy based (TRQE) to optimize weight in query expansion by considering semantic and statistic aspects from relevance evaluation of pseudo feedback to improve document retrieval performance. The proposed method has 3 main modules, they are relevace feedback, pseudo feedback, and document retrieval. TRQE is implemented in pseudo feedback module to optimize weighting term in query expansion. The evaluation result shows that TRQE can retrieve document with the highest result at precission of 100% and recall of 22,22%. TRQE for weighting optimization of query expansion is proven to improve retrieval document.     Keywords: TRQE, query expansion, term weighting, term relatedness to query, relevance feedback Abstrak..Pemilihan term yang tepat untuk memperluas queri merupakan hal yang penting pada query expansion. Oleh karena itu, perlu dilakukan optimasi penentuan term yang sesuai sehingga mampu meningkatkan performa query expansion pada system temu kembali dokumen. Penelitian ini mengajukan metode Term Relatedness to Query-Entropy based (TRQE), sebuah metode untuk mengoptimasi pembobotan pada query expansion dengan memperhatikan aspek semantic dan statistic dari penilaian relevansi suatu pseudo feedback sehingga mampu meningkatkan performa temukembali dokumen. Metode yang diusulkan memiliki 3 modul utama yaitu relevan feedback, pseudo feedback, dan document retrieval. TRQE diimplementasikan pada modul pseudo feedback untuk optimasi pembobotan term pada ekspansi query. Evaluasi hasil uji coba menunjukkan bahwa metode TRQE dapat melakukan temukembali dokumen dengan hasil terbaik pada precision  100% dan recall sebesar 22,22%.Metode TRQE untuk optimasi pembobotan pada query expansion terbukti memberikan pengaruh untuk meningkatkan relevansi pencarian dokumen.Kata Kunci: TRQE, ekspansi query, pembobotan term, term relatedness to query, relevance feedback


Sign in / Sign up

Export Citation Format

Share Document