Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results

Introduction: People’s online activities can yield clues about their emerging health conditions. We performed an intensive study to explore the feasibility of using anonymized Web query logs to screen for the emergence of pancreatic adenocarcinoma. The methods used statistical analyses of large-scale anonymized search logs considering the symptom queries from millions of people, with the potential application of warning individual searchers about the value of seeking attention from health care professionals. Methods: We identified searchers in logs of online search activity who issued special queries that are suggestive of a recent diagnosis of pancreatic adenocarcinoma. We then went back many months before these landmark queries were made, to examine patterns of symptoms, which were expressed as searches about concerning symptoms. We built statistical classifiers that predicted the future appearance of the landmark queries based on patterns of signals seen in search logs. Results: We found that signals about patterns of queries in search logs can predict the future appearance of queries that are highly suggestive of a diagnosis of pancreatic adenocarcinoma. We showed specifically that we can identify 5% to 15% of cases, while preserving extremely low false-positive rates (0.00001 to 0.0001). Conclusion: Signals in search logs show the possibilities of predicting a forthcoming diagnosis of pancreatic adenocarcinoma from combinations of subtle temporal signals revealed in the queries of searchers.

Download Full-text

A Utility-Theoretic Approach to Privacy in Online Services

Journal of Artificial Intelligence Research ◽

10.1613/jair.3089 ◽

2010 ◽

Vol 39 ◽

pp. 633-662 ◽

Cited By ~ 22

Author(s):

A. Krause ◽

E. Horvitz

Keyword(s):

Large Scale ◽

Web Search ◽

Personal Information ◽

Personal Data ◽

Theoretic Approach ◽

Search Activity ◽

Efficient Manner ◽

Special Knowledge ◽

Limit Access

Online offerings such as web search, news portals, and e-commerce applications face the challenge of providing high-quality service to a large, heterogeneous user base. Recent efforts have highlighted the potential to improve performance by introducing methods to personalize services based on special knowledge about users and their context. For example, a user's demographics, location, and past search and browsing may be useful in enhancing the results offered in response to web search queries. However, reasonable concerns about privacy by both users, providers, and government agencies acting on behalf of citizens, may limit access by services to such information. We introduce and explore an economics of privacy in personalization, where people can opt to share personal information, in a standing or on-demand manner, in return for expected enhancements in the quality of an online service. We focus on the example of web search and formulate realistic objective functions for search efficacy and privacy. We demonstrate how we can find a provably near-optimal optimization of the utility-privacy tradeoff in an efficient manner. We evaluate our methodology on data drawn from a log of the search activity of volunteer participants. We separately assess users preferences about privacy and utility via a large-scale survey, aimed at eliciting preferences about peoples willingness to trade the sharing of personal data in returns for gains in search efficiency. We show that a significant level of personalization can be achieved using a relatively small amount of information about users.

Download Full-text

Processing and Analysis of Search Query Logs in Chinese

Handbook of Research on Web Log Analysis ◽

10.4018/978-1-59904-974-8.ch019 ◽

2011 ◽

pp. 378-388 ◽

Cited By ~ 1

Author(s):

Michael Chau ◽

Yan Lu ◽

Xiao Fang ◽

Christopher C. Yang

Keyword(s):

World Wide ◽

Web Search ◽

Searching Behavior ◽

Web Searching ◽

Search Queries ◽

Web Search Engine ◽

The World ◽

Query Logs ◽

Search Logs ◽

The Web

More non-English contents are now available on the World Wide Web and the number of non-English users on the Web is increasing. While it is important to understand the Web searching behavior of these non-English users, many previous studies on Web query logs have focused on analyzing English search logs and their results may not be directly applied to other languages. In this Chapter we discuss some methods and techniques that can be used to analyze search queries in Chinese. We also show an example of applying our methods on a Chinese Web search engine. Some interesting findings are reported.

Download Full-text

Are Topics Interesting or Not? An LDA-based Topic-graph Probabilistic Model for Web Search Personalization

ACM Transactions on Information Systems ◽

10.1145/3476106 ◽

2022 ◽

Vol 40 (3) ◽

pp. 1-24

Author(s):

Jiashu Zhao ◽

Jimmy Xiangji Huang ◽

Hongbo Deng ◽

Yi Chang ◽

Long Xia

Keyword(s):

Probabilistic Model ◽

Large Scale ◽

Web Search ◽

Latent Dirichlet Allocation ◽

State Of The Art ◽

User Profile ◽

New Approach ◽

Latent Topic ◽

Search History ◽

Search Logs

In this article, we propose a Latent Dirichlet Allocation– (LDA) based topic-graph probabilistic personalization model for Web search. This model represents a user graph in a latent topic graph and simultaneously estimates the probabilities that the user is interested in the topics, as well as the probabilities that the user is not interested in the topics. For a given query issued by the user, the webpages that have higher relevancy to the interested topics are promoted, and the webpages more relevant to the non-interesting topics are penalized. In particular, we simulate a user’s search intent by building two profiles: A positive user profile for the probabilities of the user is interested in the topics and a corresponding negative user profile for the probabilities of being not interested in the the topics. The profiles are estimated based on the user’s search logs. A clicked webpage is assumed to include interesting topics. A skipped (viewed but not clicked) webpage is assumed to cover some non-interesting topics to the user. Such estimations are performed in the latent topic space generated by LDA. Moreover, a new approach is proposed to estimate the correlation between a given query and the user’s search history so as to determine how much personalization should be considered for the query. We compare our proposed models with several strong baselines including state-of-the-art personalization approaches. Experiments conducted on a large-scale real user search log collection illustrate the effectiveness of the proposed models.

Download Full-text

AOL4PS: A Large-Scale Dataset for Personalized Search

Data Intelligence ◽

10.1162/dint_a_00104 ◽

2021 ◽

pp. 1-17

Author(s):

Qian Guo ◽

Wei Chen ◽

Huaiyu Wan

Keyword(s):

Data Processing ◽

Large Scale ◽

Web Search ◽

Personalized Search ◽

Search Methods ◽

Search Models ◽

Large Scale Dataset ◽

Query Logs ◽

Commercial Search Engine ◽

Public Datasets

Abstract Personalized search is a promising way to improve the quality of web search, and it has attracted much attention from both academic and industrial communities. Much of the current related research is based on commercial search engine data, which can not be released publicly for such reasons as privacy protection and information security. This leads to a serious lack of accessible public datasets in this field. The few available datasets though released to the public have not become widely used in academia due to the complexity of the processing process. The lack of datasets together with the difficulties of data processing have brought obstacles to fair comparison and evaluation of personalized search models. In this paper, we constructed a large-scale dataset AOL4PS to evaluate personalized search methods, collected and processed from AOL query logs. We present the complete and detailed data processing and construction process. Specifically, to address the challenges of processing time and storage space demands brought by massive data volumes, we optimized the process of dataset construction and proposed an improved BM25 algorithm. Experiments are performed on AOL4PS with some classic and state-of-the-art personalized search methods, and the experiment results demonstrate that AOL4PS can measure the effect of personalized search models. AOL4PS is publicly available at http://github.com/wanhuaiyu/AOL4PS.

Download Full-text

Improving Entity Recommendation with Search Log and Multi-Task Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/571 ◽

2018 ◽

Cited By ~ 5

Author(s):

Jizhou Huang ◽

Wei Zhang ◽

Yaming Sun ◽

Haifeng Wang ◽

Ting Liu

Keyword(s):

Search Engine ◽

Large Scale ◽

Web Search ◽

Context Information ◽

Context Aware ◽

Time Step ◽

Current Time ◽

Task Learning ◽

Web Search Engine ◽

Search Logs

Entity recommendation, providing search users with an improved experience by assisting them in finding related entities for a given query, has become an indispensable feature of today's Web search engine. Existing studies typically only consider the query issued at the current time step while ignoring the in-session preceding queries. Thus, they typically fail to handle the ambiguous queries such as "apple" because the model could not understand which apple (company or fruit) is talked about. In this work, we believe that the in-session contexts convey valuable evidences that could facilitate the semantic modeling of queries, and take that into consideration for entity recommendation. Furthermore, in order to better model the semantics of queries, we learn the model in a multi-task learning setting where the query representation is shared across entity recommendation and context-aware ranking. We evaluate our approach using large-scale, real-world search logs of a widely used commercial Web search engine. The experimental results show that incorporating context information significantly improves entity recommendation, and learning the model in a multi-task learning setting could bring further improvements.

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text

The Matter of Chance: Auditing Web Search Results Related to the 2020 U.S. Presidential Primary Elections Across Six Search Engines

Social Science Computer Review ◽

10.1177/08944393211006863 ◽

2021 ◽

pp. 089443932110068

Author(s):

Aleksandra Urman ◽

Mykola Makhortykh ◽

Roberto Ulloa

Keyword(s):

Search Engine ◽

Search Engines ◽

Large Scale ◽

Web Search ◽

Primary Elections ◽

Virtual Agents ◽

Search Results ◽

Presidential Primary ◽

Large Scale Analysis ◽

Algorithmic Information

We examine how six search engines filter and rank information in relation to the queries on the U.S. 2020 presidential primary elections under the default—that is nonpersonalized—conditions. For that, we utilize an algorithmic auditing methodology that uses virtual agents to conduct large-scale analysis of algorithmic information curation in a controlled environment. Specifically, we look at the text search results for “us elections,” “donald trump,” “joe biden,” “bernie sanders” queries on Google, Baidu, Bing, DuckDuckGo, Yahoo, and Yandex, during the 2020 primaries. Our findings indicate substantial differences in the search results between search engines and multiple discrepancies within the results generated for different agents using the same search engine. It highlights that whether users see certain information is decided by chance due to the inherent randomization of search results. We also find that some search engines prioritize different categories of information sources with respect to specific candidates. These observations demonstrate that algorithmic curation of political information can create information inequalities between the search engine users even under nonpersonalized conditions. Such inequalities are particularly troubling considering that search results are highly trusted by the public and can shift the opinions of undecided voters as demonstrated by previous research.

Download Full-text

Identifying comparable entities with indirectly associative relations and word embeddings from web search logs

Decision Support Systems ◽

10.1016/j.dss.2020.113465 ◽

2020 ◽

pp. 113465

Author(s):

Liye Wang ◽

Jin Zhang ◽

Guoqing Chen ◽

Dandan Qiao

Keyword(s):

Web Search ◽

Word Embeddings ◽

Search Logs

Download Full-text

Part-of-speech tagging for web search queries using a large-scale web corpus

Proceedings of the Symposium on Applied Computing - SAC '17 ◽

10.1145/3019612.3019694 ◽

2017 ◽

Cited By ~ 1

Author(s):

Atsushi Keyaki ◽

Jun Miyazaki

Keyword(s):

Large Scale ◽

Web Search ◽

Search Queries ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Using Participatory Spatial Tools to Unravel Community Perceptions of Land-Use Dynamics in a Mine-Expanding Landscape in Ghana

Environmental Management ◽

10.1007/s00267-021-01494-7 ◽

2021 ◽

Author(s):

Jane J. Aggrey ◽

Mirjam A. F. Ros-Tonen ◽

Kwabena O. Asubonteng

Keyword(s):

Land Use ◽

Adverse Effects ◽

Large Scale ◽

Sub Saharan Africa ◽

Small Scale ◽

Eastern Region ◽

Food Crop ◽

Mosaic Landscape ◽

Rural Landscapes ◽

The Future

AbstractArtisanal and small-scale mining (ASM) in sub-Saharan Africa creates considerable dynamics in rural landscapes. Many studies addressed the adverse effects of mining, but few studies use participatory spatial tools to assess the effects on land use. Hence, this paper takes an actor perspective to analyze how communities in a mixed farming-mining area in Ghana’s Eastern Region perceive the spatial dynamics of ASM and its effects on land for farming and food production from past (1986) to present (2018) and toward the future (2035). Participatory maps show how participants visualize the transformation of food-crop areas into small- and large-scale mining, tree crops, and settlement in all the communities between 1986 and 2018 and foresee these trends to continue in the future (2035). Participants also observe how a mosaic landscape shifts toward a segregated landscape, with simultaneous fragmentation of their farming land due to ASM. Further segregation is expected in the future, with attribution to the expansion of settlements being an unexpected outcome. Although participants expect adverse effects on the future availability of food-crop land, no firm conclusions can be drawn about the anticipated effect on food availability. The paper argues that, if responsibly applied and used to reveal community perspectives and concerns about landscape dynamics, participatory mapping can help raise awareness of the need for collective action and contribute to more inclusive landscape governance. These findings contribute to debates on the operationalization of integrated and inclusive landscape approaches and governance, particularly in areas with pervasive impacts of ASM.

Download Full-text