A New Hybrid Document Clustering for PRF-Based Automatic Query Expansion Approach for Effective IR

2020 ◽  
Vol 16 (3) ◽  
pp. 73-95
Author(s):  
Yogesh Gupta ◽  
Ashish Saini

Automatic query expansion (AQE) is an effective measure to improve information retrieval performance by including additional terms in a user query. The pseudo relevance feedback (PRF) method employed for AQE so far has suffered from a major problem of query drift. Therefore, keeping it in view, a new hybrid document clustering for PRF based AQE approach is proposed in the present article. In this, Fuzzy logic and Particle Swarm Optimization (PSO) are used to construct document clusters. Further, a new and effective hybrid PSO and Fuzzy logic-based term weighting approach is followed to find more suitable additional query terms using a weighted score of four IR evidences which is considered maximized. Moreover, a combined semantic filtering method along with query terms re-weighting algorithms are also used to remove noisy or irrelevant terms semantically. The performance of the presented approaches in this article is tested and compared with other approaches on three benchmark data sets. The comparative analysis of all the tested approaches illustrates the superior performance of the proposed approach.

2015 ◽  
Vol 6 (3) ◽  
Author(s):  
Resti Ludviani ◽  
Khadijah F. Hayati ◽  
Agus Zainal Arifin ◽  
Diana Purwitasari

Abstract. An appropriate selection term for expanding a query is very important in query expansion. Therefore, term selection optimization is added to improve query expansion performance on document retrieval system. This study proposes a new approach named Term Relatedness to Query-Entropy based (TRQE) to optimize weight in query expansion by considering semantic and statistic aspects from relevance evaluation of pseudo feedback to improve document retrieval performance. The proposed method has 3 main modules, they are relevace feedback, pseudo feedback, and document retrieval. TRQE is implemented in pseudo feedback module to optimize weighting term in query expansion. The evaluation result shows that TRQE can retrieve document with the highest result at precission of 100% and recall of 22,22%. TRQE for weighting optimization of query expansion is proven to improve retrieval document.     Keywords: TRQE, query expansion, term weighting, term relatedness to query, relevance feedback Abstrak..Pemilihan term yang tepat untuk memperluas queri merupakan hal yang penting pada query expansion. Oleh karena itu, perlu dilakukan optimasi penentuan term yang sesuai sehingga mampu meningkatkan performa query expansion pada system temu kembali dokumen. Penelitian ini mengajukan metode Term Relatedness to Query-Entropy based (TRQE), sebuah metode untuk mengoptimasi pembobotan pada query expansion dengan memperhatikan aspek semantic dan statistic dari penilaian relevansi suatu pseudo feedback sehingga mampu meningkatkan performa temukembali dokumen. Metode yang diusulkan memiliki 3 modul utama yaitu relevan feedback, pseudo feedback, dan document retrieval. TRQE diimplementasikan pada modul pseudo feedback untuk optimasi pembobotan term pada ekspansi query. Evaluasi hasil uji coba menunjukkan bahwa metode TRQE dapat melakukan temukembali dokumen dengan hasil terbaik pada precision  100% dan recall sebesar 22,22%.Metode TRQE untuk optimasi pembobotan pada query expansion terbukti memberikan pengaruh untuk meningkatkan relevansi pencarian dokumen.Kata Kunci: TRQE, ekspansi query, pembobotan term, term relatedness to query, relevance feedback


Entropy ◽  
2020 ◽  
Vol 23 (1) ◽  
pp. 62
Author(s):  
Zhengwei Liu ◽  
Fukang Zhu

The thinning operators play an important role in the analysis of integer-valued autoregressive models, and the most widely used is the binomial thinning. Inspired by the theory about extended Pascal triangles, a new thinning operator named extended binomial is introduced, which is a general case of the binomial thinning. Compared to the binomial thinning operator, the extended binomial thinning operator has two parameters and is more flexible in modeling. Based on the proposed operator, a new integer-valued autoregressive model is introduced, which can accurately and flexibly capture the dispersed features of counting time series. Two-step conditional least squares (CLS) estimation is investigated for the innovation-free case and the conditional maximum likelihood estimation is also discussed. We have also obtained the asymptotic property of the two-step CLS estimator. Finally, three overdispersed or underdispersed real data sets are considered to illustrate a superior performance of the proposed model.


Sensors ◽  
2021 ◽  
Vol 21 (10) ◽  
pp. 3373
Author(s):  
Ludek Cicmanec

The main objective of this paper is to describe a building process of a model predicting the soil strength at unpaved airport surfaces (unpaved runways, safety areas in runway proximity, runway strips, and runway end safety areas). The reason for building this model is to partially substitute frequent and meticulous inspections of an airport movement area comprising the bearing strength evaluation and provide an efficient tool to organize surface maintenance. Since the process of building such a model is complex for a physical model, it is anticipated that it might be addressed by a statistical model instead. Therefore, fuzzy logic (FL) and artificial neural network (ANN) capabilities are investigated and compared with linear regression function (LRF). Large data sets comprising the bearing strength and meteorological characteristics are applied to train the likely model variations to be subsequently compared with the application of standard statistical quantitative parameters. All the models prove that the inclusion of antecedent soil strength as an additional model input has an immense impact on the increase in model accuracy. Although the M7 model out of the ANN group displays the best performance, the M3 model is considered for practical implications being less complicated and having fewer inputs. In general, both the ANN and FL models outperform the LRF models well in all the categories. The FL models perform almost equally as well as the ANN but with slightly decreased accuracy.


2018 ◽  
Vol 2018 ◽  
pp. 1-19 ◽  
Author(s):  
Phet Aimtongkham ◽  
Tri Gia Nguyen ◽  
Chakchai So-In

Network congestion is a key challenge in resource-constrained networks, particularly those with limited bandwidth to accommodate high-volume data transmission, which causes unfavorable quality of service, including effects such as packet loss and low throughput. This challenge is crucial in wireless sensor networks (WSNs) with restrictions and constraints, including limited computing power, memory, and transmission due to self-contained batteries, which limit sensor node lifetime. Determining a path to avoid congested routes can prolong the network. Thus, we present a path determination architecture for WSNs that takes congestion into account. The architecture is divided into 3 stages, excluding the final criteria for path determination: (1) initial path construction in a top-down hierarchical structure, (2) path derivation with energy-aware assisted routing, and (3) congestion prediction using exponential smoothing. With several factors, such as hop count, remaining energy, buffer occupancy, and forwarding rate, we apply fuzzy logic systems to determine proper weights among those factors in addition to optimizing the weight over the membership functions using a bat algorithm. The simulation results indicate the superior performance of the proposed method in terms of high throughput, low packet loss, balancing the overall energy consumption, and prolonging the network lifetime compared to state-of-the-art protocols.


Kursor ◽  
2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ivanda Zevi Amalia ◽  
Akbar Noto Ponco Bimantoro ◽  
Agus Zainal Arifin ◽  
Maryamah Faisol ◽  
Rarasmaya Indraswari ◽  
...  

In general, hadith consists of isnad and matan (content). Matan can be separated into several components for example a story, main content, and some additional information. Other texts besides main content, such as isnad and story can interfere the retrieval process of relevant documents because most users typically use simple queries. Thus, in this paper, we proposed a Named Entity Recognition (NER) component weighting model in improving the Indonesian hadith retrieval system. We did 3 test scenarios, the first scenario (S1) did not separate the hadith into several components, the second scenario (S2) separated the hadith into 2 components, isnad and matan, and the third scenario separated the hadith into 4 components, isnad, background story, content, and additional information. From the experimental results, it is found that the TF-IDF with rocchio algorithm in query expansion outperforms DocVec. Also, separation and weighting of the hadith components affect the retrieval performance because isnad can be considered as noise in a query. Separation of 2 separate components had the best overall results in general although 4 separate components showed better results in some cases with precision up to 100% and 70% recall.


2016 ◽  
Vol 6 (2) ◽  
pp. 41-65 ◽  
Author(s):  
Sheetal A. Takale ◽  
Prakash J. Kulkarni ◽  
Sahil K. Shah

Information available on the internet is huge, diverse and dynamic. Current Search Engine is doing the task of intelligent help to the users of the internet. For a query, it provides a listing of best matching or relevant web pages. However, information for the query is often spread across multiple pages which are returned by the search engine. This degrades the quality of search results. So, the search engines are drowning in information, but starving for knowledge. Here, we present a query focused extractive summarization of search engine results. We propose a two level summarization process: identification of relevant theme clusters, and selection of top ranking sentences to form summarized result for user query. A new approach to semantic similarity computation using semantic roles and semantic meaning is proposed. Document clustering is effectively achieved by application of MDL principle and sentence clustering and ranking is done by using SNMF. Experiments conducted demonstrate the effectiveness of system in semantic text understanding, document clustering and summarization.


2019 ◽  
Vol 21 (5) ◽  
pp. 1581-1595 ◽  
Author(s):  
Xinlei Zhao ◽  
Shuang Wu ◽  
Nan Fang ◽  
Xiao Sun ◽  
Jue Fan

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.


Author(s):  
Ahmed Abbache ◽  
Farid Meziane ◽  
Ghalem Belalem ◽  
Fatma Zohra Belkredim

Query expansion is the process of adding additional relevant terms to the original queries to improve the performance of information retrieval systems. However, previous studies showed that automatic query expansion using WordNet do not lead to an improvement in the performance. One of the main challenges of query expansion is the selection of appropriate terms. In this paper, the authors review this problem using Arabic WordNet and Association Rules within the context of Arabic Language. The results obtained confirmed that with an appropriate selection method, the authors are able to exploit Arabic WordNet to improve the retrieval performance. Their empirical results on a sub-corpus from the Xinhua collection showed that their automatic selection method has achieved a significant performance improvement in terms of MAP and recall and a better precision with the first top retrieved documents.


Sign in / Sign up

Export Citation Format

Share Document