State-of-the-Art Survey on Web Search

Name ambiguity, due to the fact that many people share an identical name, often deteriorates the performance of information integration, document retrieval and web search. In academic data analysis, author name ambiguity usually decreases the analysis performance. To solve this problem, an author name disambiguation task is designed to divide documents related to an author name reference into several parts and each part is associated with a real-life person. Existing methods usually use either attributes of documents or relationships between documents and co-authors. However, methods of feature extraction using attributes cause inflexibility of models while solutions based on relationship graph network ignore the information contained in the features. In this paper, we propose a novel name disambiguation model based on representation learning which incorporates attributes and relationships. Experiments on a public real dataset demonstrate the effectiveness of our model and experimental results demonstrate that our solution is superior to several state-of-the-art graph-based methods. We also increase the interpretability of our method through information theory and show that the analysis could be helpful for model selection and training progress.

Download Full-text

Performance Driven Development Framework for Web Applications

Global Journal of Enterprise Information System ◽

10.18311/gjeis/2017/15870 ◽

2017 ◽

Vol 9 (1) ◽

pp. 75

Author(s):

K. S. Shailesh ◽

P. V. Suresh

Keyword(s):

Customer Loyalty ◽

Performance Optimization ◽

Web Search ◽

Web Applications ◽

State Of The Art ◽

End User ◽

Web Performance ◽

Performance Engineering ◽

Development Framework ◽

Performance Patterns

The performance of web applications is of paramount importance as it can impact end-user experience and the business revenue. Web Performance Optimization (WPO) deals with front-end performance engineering. Web performance would impact customer loyalty, SEO, web search ranking, SEO, site traffic, repeat visitors and overall online revenue. In this paper we have conducted the survey of state of the art tools, techniques, methodologies of various aspects of web performance optimization. We have identified key web performance patterns and proposed novel web performance driven development framework. We have elaborated on various techniques related to different phases of web performance driven development framework.

Download Full-text

Semantic expansion of search queries

10.32920/ryerson.14654316.v1 ◽

2021 ◽

Author(s):

Andisheh Keikha

Keyword(s):

Relevance Feedback ◽

Query Expansion ◽

Web Search ◽

State Of The Art ◽

Feature Learning ◽

Correct Interpretation ◽

Search Queries ◽

Term Selection ◽

Semantic Expansion ◽

Pseudo Relevance Feedback

One of the major challenges in Web search pertains to the correct interpretation of users' intent. Query Expansion is one of the well-known approaches for determining the intent of the user by addressing the vocabulary mismatch problem. A limitation of the current query expansion approaches is that the relations between the query words and the expanded words is limited. In this thesis, we capture users' intent through query expansion. We build on earlier work in the area by adopting a pseudo-relevance feedback approach; however, we advance the state of the art by proposing an approach for feature learning within the process of query expansion. In our work, we specifically consider the Wikipedia corpus as the feedback collection space and identify the best features within this context for term selection in two supervised and unsupervised models. We compare our work with state of the art query expansion techniques, the results of which show promising robustness and improved precision.

Download Full-text

Literature Retrieval and Mining in Bioinformatics: State of the Art and Challenges

Advances in Bioinformatics ◽

10.1155/2012/573846 ◽

2012 ◽

Vol 2012 ◽

pp. 1-10 ◽

Cited By ~ 11

Author(s):

Andrea Manconi ◽

Eloisa Vargiu ◽

Giuliano Armano ◽

Luciano Milanesi

Keyword(s):

Information Retrieval ◽

Web Search ◽

State Of The Art ◽

Information Access ◽

Daily Basis ◽

Huge Amount ◽

Dominant Form ◽

Web Search Engine ◽

The World ◽

E Mail

The world has widely changed in terms of communicating, acquiring, and storing information. Hundreds of millions of people are involved in information retrieval tasks on a daily basis, in particular while using a Web search engine or searching their e-mail, making such field the dominant form of information access, overtaking traditional database-style searching. How to handle this huge amount of information has now become a challenging issue. In this paper, after recalling the main topics concerning information retrieval, we present a survey on the main works on literature retrieval and mining in bioinformatics. While claiming that information retrieval approaches are useful in bioinformatics tasks, we discuss some challenges aimed at showing the effectiveness of these approaches applied therein.

Download Full-text

Are Topics Interesting or Not? An LDA-based Topic-graph Probabilistic Model for Web Search Personalization

ACM Transactions on Information Systems ◽

10.1145/3476106 ◽

2022 ◽

Vol 40 (3) ◽

pp. 1-24

Author(s):

Jiashu Zhao ◽

Jimmy Xiangji Huang ◽

Hongbo Deng ◽

Yi Chang ◽

Long Xia

Keyword(s):

Probabilistic Model ◽

Large Scale ◽

Web Search ◽

Latent Dirichlet Allocation ◽

State Of The Art ◽

User Profile ◽

New Approach ◽

Latent Topic ◽

Search History ◽

Search Logs

In this article, we propose a Latent Dirichlet Allocation– (LDA) based topic-graph probabilistic personalization model for Web search. This model represents a user graph in a latent topic graph and simultaneously estimates the probabilities that the user is interested in the topics, as well as the probabilities that the user is not interested in the topics. For a given query issued by the user, the webpages that have higher relevancy to the interested topics are promoted, and the webpages more relevant to the non-interesting topics are penalized. In particular, we simulate a user’s search intent by building two profiles: A positive user profile for the probabilities of the user is interested in the topics and a corresponding negative user profile for the probabilities of being not interested in the the topics. The profiles are estimated based on the user’s search logs. A clicked webpage is assumed to include interesting topics. A skipped (viewed but not clicked) webpage is assumed to cover some non-interesting topics to the user. Such estimations are performed in the latent topic space generated by LDA. Moreover, a new approach is proposed to estimate the correlation between a given query and the user’s search history so as to determine how much personalization should be considered for the query. We compare our proposed models with several strong baselines including state-of-the-art personalization approaches. Experiments conducted on a large-scale real user search log collection illustrate the effectiveness of the proposed models.

Download Full-text

Time-aware query suggestion diversification for temporally ambiguous queries

The Electronic Library ◽

10.1108/el-12-2019-0296 ◽

2020 ◽

Vol 38 (4) ◽

pp. 725-744

Author(s):

Xiaojuan Zhang ◽

Xixi Jiang ◽

Jiewen Qin

Keyword(s):

Digital Libraries ◽

Web Search ◽

State Of The Art ◽

Experimental Information ◽

Query Suggestion ◽

High Coverage ◽

Query Log ◽

Content Type ◽

Search Tasks ◽

Time Aware

Purpose The purpose of this study is to generate diversified results for temporally ambiguous queries and the candidate queries are ensured to have a high coverage of subtopics, which are derived from different temporal periods. Design/methodology/approach Two novel time-aware query suggestion diversification models are developed by integrating semantics and temporality information involved in queries into two state-of-the-art explicit diversification algorithms (i.e. IA-select and xQuaD), respectively, and then specifying the components on which these two models rely on. Most importantly, first explored is how to explicitly determine query subtopics for each unique query from the query log or clicked documents and then modeling the subtopics into query suggestion diversification. The discussion on how to mine temporal intent behind a query from query log is also followed. Finally, to verify the effectiveness of the proposal, experiments on a real-world query log are conducted. Findings Preliminary experiments demonstrate that the proposed method can significantly outperform the existing state-of-the-art methods in terms of producing the candidate query suggestion for temporally ambiguous queries. Originality/value This study reports the first attempt to generate query suggestions indicating diverse interested time points to the temporally ambiguous (input) queries. The research will be useful in enhancing users’ search experience through helping them to formulate accurate queries for their search tasks. In addition, the approaches investigated in the paper are general enough to be used in many domains; that is, experimental information retrieval systems, Web search engines, document archives and digital libraries.

Download Full-text

Query Rewriting Using Monolingual Statistical Machine Translation

Computational Linguistics ◽

10.1162/coli_a_00010 ◽

2010 ◽

Vol 36 (3) ◽

pp. 569-582 ◽

Cited By ~ 23

Author(s):

Stefan Riezler ◽

Yi Liu

Keyword(s):

Machine Translation ◽

Query Expansion ◽

Web Search ◽

Query Language ◽

State Of The Art ◽

Statistical Machine Translation ◽

Target Language ◽

Translation Model ◽

User Query ◽

Query Logs

Long queries often suffer from low recall in Web search due to conjunctive term matching. The chances of matching words in relevant documents can be increased by rewriting query terms into new terms with similar statistical properties. We present a comparison of approaches that deploy user query logs to learn rewrites of query terms into terms from the document space. We show that the best results are achieved by adopting the perspective of bridging the “lexical chasm” between queries and documents by translating from a source language of user queries into a target language of Web documents. We train a state-of-the-art statistical machine translation model on query-snippet pairs from user query logs, and extract expansion terms from the query rewrites produced by the monolingual translation system. We show in an extrinsic evaluation in a real-world Web search task that the combination of a query-to-snippet translation model with a query language model achieves improved contextual query expansion compared to a state-of-the-art query expansion model that is trained on the same query log data.

Download Full-text

Easy Web Search Results Clustering: When Baselines Can Reach State-of-the-Art Algorithms

10.3115/v1/e14-4001 ◽

2014 ◽

Cited By ~ 2

Author(s):

Jose G. Moreno ◽

Gaël Dias

Keyword(s):

Web Search ◽

State Of The Art ◽

Search Results ◽

Search Results Clustering

Download Full-text

Unsupervised learning of semantic representation for documents with the law of total probability

Natural Language Engineering ◽

10.1017/s1351324917000420 ◽

2017 ◽

Vol 24 (4) ◽

pp. 491-522 ◽

Cited By ~ 2

Author(s):

YANG WEI ◽

JINMAO WEI ◽

ZHENGLU YANG

Keyword(s):

Unsupervised Learning ◽

Text Analysis ◽

Web Search ◽

Semantic Information ◽

State Of The Art ◽

Semantic Representation ◽

Document Clustering ◽

Total Probability ◽

Document Summarization ◽

The Law

AbstractThe semantic information of documents needs to be represented because it is the basis for many applications, such as document summarization, web search, and text analysis. Although many studies have explored this problem by enriching document vectors with the relatedness of the words involved, the performance remains far from satisfactory because the physical boundaries of documents hinder the evaluation of the relatedness between words. To address this problem, we propose an effective approach to further infer the implicit relatedness between words via their common related words. To avoid overestimation of the implicit relatedness, we restrict the inference in terms of the marginal probabilities of the words based on the law of total probability. The proposed method measures the relatedness between words, which is confirmed theoretically and experimentally. Thorough evaluation on real datasets illustrates that significant improvement on document clustering has been achieved with the proposed method compared with state-of-the-art methods.

Download Full-text

Ranking Web Search Results Exploiting Wikipedia

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213016500184 ◽

2016 ◽

Vol 25 (03) ◽

pp. 1650018 ◽

Cited By ~ 1

Author(s):

Andreas Kanavos ◽

Christos Makris ◽

Yannis Plegas ◽

Evangelos Theodoridis

Keyword(s):

Web Search ◽

State Of The Art ◽

Web Pages ◽

Ranking Methods ◽

Data Mining Technique ◽

Topic Identification ◽

Probabilistic Network ◽

Current State ◽

The Web ◽

Selection Of

It is widely known that search engines are the dominating tools for finding information on the web. In most of the cases, these engines return web page references on a global ranking taking in mind either the importance of the web site or the relevance of the web pages to the identified topic. In this paper, we focus on the problem of determining distinct thematic groups on web search engine results that other existing engines provide. We additionally address the problem of dynamically adapting their ranking according to user selections, incorporating user judgments as implicitly registered in their selection of relevant documents. Our system exploits a state of the art semantic web data mining technique that identifies semantic entities of Wikipedia for grouping the result set in different topic groups, according to the various meanings of the provided query. Moreover, we propose a novel probabilistic Network scheme that employs the aforementioned topic identification method, in order to modify ranking of results as the users select documents. We evaluated in practice our implemented prototype with extensive experiments with the ClueWeb09 dataset using the TREC’s 2009, 2010, 2011 and 2012 Web Tracks’ where we observed improved retrieval performance compared to current state of the art re-ranking methods.

Download Full-text