Clarifying Ambiguous Keywords with Personal Word Embeddings for Personalized Search

2022 ◽  
Vol 40 (3) ◽  
pp. 1-29
Author(s):  
Jing Yao ◽  
Zhicheng Dou ◽  
Ji-Rong Wen

Personalized search tailors document ranking lists for each individual user based on her interests and query intent to better satisfy the user’s information need. Many personalized search models have been proposed. They first build a user interest profile from the user’s search history, and then re-rank the documents based on the personalized matching scores between the created profile and candidate documents. In this article, we attempt to solve the personalized search problem from an alternative perspective of clarifying the user’s intention of the current query. We know that there are many ambiguous words in natural language such as “Apple.” People with different knowledge backgrounds and interests have personalized understandings of these words. Therefore, we propose a personalized search model with personal word embeddings for each individual user that mainly contain the word meanings that the user already knows and can reflect the user interests. To learn great personal word embeddings, we design a pre-training model that captures both the textual information of the query log and the information about user interests contained in the click-through data represented as a graph structure. With personal word embeddings, we obtain the personalized word and context-aware representations of the query and documents. Furthermore, we also employ the current session as the short-term search context to dynamically disambiguate the current query. Finally, we use a matching model to calculate the matching score between the personalized query and document representations for ranking. Experimental results on two large-scale query logs show that our designed model significantly outperforms state-of-the-art personalization models.

2020 ◽  
Vol 13 (2) ◽  
pp. 240-247 ◽  
Author(s):  
Bilal Hawashin ◽  
Darah Aqel ◽  
Shadi Alzubi ◽  
Mohammad Elbes

Background: Recommender Systems use user interests to provide more accurate recommendations according to user actual interests and behavior. Methods: This work aims at improving recommender systems by discovering hidden user interests from the existing interests. User interest expansion would contribute in improving the accuracy of recommender systems by finding more user interests using the given ones. Two methods are proposed to perform the expansion: Expanding interests using correlated interests’ extractor and Expanding interests using word embeddings. Results: Experimental work shows that such expanding is efficient in terms of accuracy and execution time. Conclusion: Therefore, expanding user interests proved to be a promising step in the improvement of the recommender systems performance.


2021 ◽  
Vol 55 (1) ◽  
pp. 1-2
Author(s):  
Bhaskar Mitra

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.


2021 ◽  
Vol 15 (3) ◽  
pp. 1-31
Author(s):  
Haida Zhang ◽  
Zengfeng Huang ◽  
Xuemin Lin ◽  
Zhe Lin ◽  
Wenjie Zhang ◽  
...  

Driven by many real applications, we study the problem of seeded graph matching. Given two graphs and , and a small set of pre-matched node pairs where and , the problem is to identify a matching between and growing from , such that each pair in the matching corresponds to the same underlying entity. Recent studies on efficient and effective seeded graph matching have drawn a great deal of attention and many popular methods are largely based on exploring the similarity between local structures to identify matching pairs. While these recent techniques work provably well on random graphs, their accuracy is low over many real networks. In this work, we propose to utilize higher-order neighboring information to improve the matching accuracy and efficiency. As a result, a new framework of seeded graph matching is proposed, which employs Personalized PageRank (PPR) to quantify the matching score of each node pair. To further boost the matching accuracy, we propose a novel postponing strategy, which postpones the selection of pairs that have competitors with similar matching scores. We show that the postpone strategy indeed significantly improves the matching accuracy. To improve the scalability of matching large graphs, we also propose efficient approximation techniques based on algorithms for computing PPR heavy hitters. Our comprehensive experimental studies on large-scale real datasets demonstrate that, compared with state-of-the-art approaches, our framework not only increases the precision and recall both by a significant margin but also achieves speed-up up to more than one order of magnitude.


2017 ◽  
Vol 45 (3) ◽  
pp. 130-138 ◽  
Author(s):  
Basit Shahzad ◽  
Ikramullah Lali ◽  
M. Saqib Nawaz ◽  
Waqar Aslam ◽  
Raza Mustafa ◽  
...  

Purpose Twitter users’ generated data, known as tweets, are now not only used for communication and opinion sharing, but they are considered an important source of trendsetting, future prediction, recommendation systems and marketing. Using network features in tweet modeling and applying data mining and deep learning techniques on tweets is gaining more and more interest. Design/methodology/approach In this paper, user interests are discovered from Twitter Trends using a modeling approach that uses network-based text data (tweets). First, the popular trends are collected and stored in separate documents. These data are then pre-processed, followed by their labeling in respective categories. Data are then modeled and user interest for each Trending topic is calculated by considering positive tweets in that trend, average retweet and favorite count. Findings The proposed approach can be used to infer users’ topics of interest on Twitter and to categorize them. Support vector machine can be used for training and validation purposes. Positive tweets can be further analyzed to find user posting patterns. There is a positive correlation between tweets and Google data. Practical implications The results can be used in the development of information filtering and prediction systems, especially in personalized recommendation systems. Social implications Twitter microblogging platform offers content posting and sharing to billions of internet users worldwide. Therefore, this work has significant socioeconomic impacts. Originality/value This study guides on how Twitter network structure features can be exploited in discovering user interests using tweets. Further, positive correlation of Twitter Trends with Google Trends is reported, which validates the correctness of the authors’ approach.


Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Tao Chen ◽  
Mingfen Wu ◽  
Hexi Li

Abstract The automatic extraction of meaningful relations from biomedical literature or clinical records is crucial in various biomedical applications. Most of the current deep learning approaches for medical relation extraction require large-scale training data to prevent overfitting of the training model. We propose using a pre-trained model and a fine-tuning technique to improve these approaches without additional time-consuming human labeling. Firstly, we show the architecture of Bidirectional Encoder Representations from Transformers (BERT), an approach for pre-training a model on large-scale unstructured text. We then combine BERT with a one-dimensional convolutional neural network (1d-CNN) to fine-tune the pre-trained model for relation extraction. Extensive experiments on three datasets, namely the BioCreative V chemical disease relation corpus, traditional Chinese medicine literature corpus and i2b2 2012 temporal relation challenge corpus, show that the proposed approach achieves state-of-the-art results (giving a relative improvement of 22.2, 7.77, and 38.5% in F1 score, respectively, compared with a traditional 1d-CNN classifier). The source code is available at https://github.com/chentao1999/MedicalRelationExtraction.


2013 ◽  
Vol 303-306 ◽  
pp. 1420-1425
Author(s):  
Qiang Pu ◽  
Ahmed Lbath ◽  
Da Qing He

Mobile personalized web search has been introduced for the purpose of distinguishing mobile user's personal different search interest. We first take the user's location information into account to do a geographic query expansion, then present an approach to personalizing web search for mobile users within language modeling framework. We estimate a user mixed model estimated according to both activated ontological topic model-based feedback and user interest model to re-rank the results from geographic query expansion. Experiments show that language model based re-ranking method is effective in presenting more relevant documents on the top retrieved results to mobile users. The main contribution of the improvements comes from the consideration of geographic information, ontological topic information and user interests together to find more relevant documents for satisfying their personal information need.


Entropy ◽  
2020 ◽  
Vol 22 (10) ◽  
pp. 1168
Author(s):  
Min Zhang ◽  
Guohua Geng ◽  
Sheng Zeng ◽  
Huaping Jia

Knowledge graph completion can make knowledge graphs more complete, which is a meaningful research topic. However, the existing methods do not make full use of entity semantic information. Another challenge is that a deep model requires large-scale manually labelled data, which greatly increases manual labour. In order to alleviate the scarcity of labelled data in the field of cultural relics and capture the rich semantic information of entities, this paper proposes a model based on the Bidirectional Encoder Representations from Transformers (BERT) with entity-type information for the knowledge graph completion of the Chinese texts of cultural relics. In this work, the knowledge graph completion task is treated as a classification task, while the entities, relations and entity-type information are integrated as a textual sequence, and the Chinese characters are used as a token unit in which input representation is constructed by summing token, segment and position embeddings. A small number of labelled data are used to pre-train the model, and then, a large number of unlabelled data are used to fine-tune the pre-training model. The experiment results show that the BERT-KGC model with entity-type information can enrich the semantics information of the entities to reduce the degree of ambiguity of the entities and relations to some degree and achieve more effective performance than the baselines in triple classification, link prediction and relation prediction tasks using 35% of the labelled data of cultural relics.


2020 ◽  
pp. 1-51
Author(s):  
Ivan Vulić ◽  
Simon Baker ◽  
Edoardo Maria Ponti ◽  
Ulla Petti ◽  
Ira Leviant ◽  
...  

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.


2013 ◽  
Vol 380-384 ◽  
pp. 1959-1962
Author(s):  
Dong Liu ◽  
Quan Yuan Wu

Nowadays, more and more people use microblogs to share information. Consequently, mining microblog users behavior features is very valuable. In the paper, we propose a user interest mining framework. After data pre-processing, VSM is used to generate the feature vector of the tweet sets. Furthermore, k-bit binaries called interest hash-value and continuous interest hash-value are generated by use of Simhash algorithm. The user interests and change patterns could be mined by analyzing the hamming distance sequences between adjacent two hash-values. Taking Sina microblog as background, a series of experiments are done to prove the effectiveness of the algorithms.


Sign in / Sign up

Export Citation Format

Share Document