information retrieval Latest Research Papers

Building Cultural Heritage Reference Collections from Social Media through Pooling Strategies: The Case of 2020’s Tensions Over Race and Heritage

Journal on Computing and Cultural Heritage ◽

10.1145/3477604 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-13

Author(s):

David Otero ◽

Patricia Martin-Rodilla ◽

Javier Parapar

Keyword(s):

Social Networks ◽

Social Media ◽

Information Retrieval ◽

Cultural Heritage ◽

Digital Archives ◽

Retrieval Method ◽

Left Behind ◽

Heritage Studies ◽

Reference Collection ◽

Pooling Strategies

Social networks constitute a valuable source for documenting heritage constitution processes or obtaining a real-time snapshot of a cultural heritage research topic. Many heritage researchers use social networks as a social thermometer to study these processes, creating, for this purpose, collections that constitute born-digital archives potentially reusable, searchable, and of interest to other researchers or citizens. However, retrieval and archiving techniques used in social networks within heritage studies are still semi-manual, being a time-consuming task and hindering the reproducibility, evaluation, and open-up of the collections created. By combining Information Retrieval strategies with emerging archival techniques, some of these weaknesses can be left behind. Specifically, pooling is a well-known Information Retrieval method to extract a sample of documents from an entire document set (posts in case of social network’s information), obtaining the most complete and unbiased set of relevant documents on a given topic. Using this approach, researchers could create a reference collection while avoiding annotating the entire corpus of documents or posts retrieved. This is especially useful in social media due to the large number of topics treated by the same user or in the same thread or post. We present a platform for applying pooling strategies combined with expert judgment to create cultural heritage reference collections from social networks in a customisable, reproducible, documented, and shareable way. The platform is validated by building a reference collection from a social network about the recent attacks on patrimonial entities motivated by anti-racist protests. This reference collection and the results obtained from its preliminary study are available for use. This real application has allowed us to validate the platform and the pooling strategies for creating reference collections in heritage studies from social networks.

A Comprehensive Guideline for Bengali Sentiment Annotation

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3474363 ◽

2022 ◽

Vol 21 (2) ◽

pp. 1-19

Author(s):

Md. Saddam Hossain Mukta ◽

Md. Adnanul Islam ◽

Faisal Ahamed Khan ◽

Afjal Hossain ◽

Shuvanon Razik ◽

...

Keyword(s):

Data Mining ◽

Information Retrieval ◽

Sentiment Analysis ◽

Computational Linguistics ◽

Language Processing ◽

Web Mining ◽

English Language ◽

Research Work ◽

Bengali Language

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

EMD Based Semantic User Similarity using Past Travel Histories

Journal of Cases on Information Technology ◽

10.4018/jcit.20220801oa04 ◽

2022 ◽

Vol 24 (3) ◽

pp. 0-0

Keyword(s):

Information Retrieval ◽

Mobile Devices ◽

Semantic Similarity ◽

Similarity Measure ◽

Similarity Measures ◽

Cost Effective ◽

Semantic Similarity Measure ◽

User Similarity ◽

Percentage Improvement ◽

The Cost

The cost-effective and easy availability of handheld mobile devices and ubiquity of location acquisition services such as GPS and GSM networks has helped expedient logging and sharing of location histories of mobile users. This work aims to find semantic user similarity using their past travel histories. Application of the semantic similarity measure can be found in tourism-related recommender systems and information retrieval. The paper presents Earth Mover’s Distance (EMD) based semantic user similarity measure using users' GPS logs. The similarity measure is applied and evaluated on the GPS dataset of 182 users collected from April 2007 to August 2012 by Microsoft's GeoLife project. The proposed similarity measure is compared with conventional similarity measures used in literature such as Jaccard, Dice, and Pearsons’ Correlation. The percentage improvement of EMD based approach over existing approaches in terms of average RMSE is 10.70%, and average MAE is 5.73%.

Opportunities and Challenges in Code Search Tools

ACM Computing Surveys ◽

10.1145/3480027 ◽

2022 ◽

Vol 54 (9) ◽

pp. 1-40

Author(s):

Chao Liu ◽

Xin Xia ◽

David Lo ◽

Cuiyun Gao ◽

Xiaohu Yang ◽

...

Keyword(s):

Information Retrieval ◽

Deep Learning ◽

Software Engineering ◽

Software Development ◽

Large Scale ◽

Research Trends ◽

Publication Trends ◽

Code Search ◽

Search Tasks ◽

Efficiency And Effectiveness

Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged different techniques, such as deep learning and information retrieval approaches, to retrieve expected code from a large-scale codebase. However, there is a lack of a comprehensive comparative summary of existing code search approaches. To understand the research trends in existing code search studies, we systematically reviewed 81 relevant studies. We investigated the publication trends of code search studies, analyzed key components, such as codebase, query, and modeling technique used to build code search tools, and classified existing tools into focusing on supporting seven different search tasks. Based on our findings, we identified a set of outstanding challenges in existing studies and a research roadmap for future code search research.

EMD-Based Semantic User Similarity Using Past Travel Histories

Journal of Cases on Information Technology ◽

10.4018/jcit.20220701.oa2 ◽

2022 ◽

Vol 24 (3) ◽

pp. 1-17

Author(s):

Sunita Tiwari ◽

Saroj Kaushik

Keyword(s):

Information Retrieval ◽

Mobile Devices ◽

Semantic Similarity ◽

Similarity Measure ◽

Similarity Measures ◽

Cost Effective ◽

Semantic Similarity Measure ◽

User Similarity ◽

Percentage Improvement ◽

The Cost

The cost-effective and easy availability of handheld mobile devices and ubiquity of location acquisition services such as GPS and GSM networks has helped expedient logging and sharing of location histories of mobile users. This work aims to find semantic user similarity using their past travel histories. Application of the semantic similarity measure can be found in tourism-related recommender systems and information retrieval. The paper presents Earth Mover’s Distance (EMD) based semantic user similarity measure using users' GPS logs. The similarity measure is applied and evaluated on the GPS dataset of 182 users collected from April 2007 to August 2012 by Microsoft's GeoLife project. The proposed similarity measure is compared with conventional similarity measures used in literature such as Jaccard, Dice, and Pearsons’ Correlation. The percentage improvement of EMD based approach over existing approaches in terms of average RMSE is 10.70%, and average MAE is 5.73%.

Semantic Information Retrieval on Medical Texts

ACM Computing Surveys ◽

10.1145/3462476 ◽

2022 ◽

Vol 54 (7) ◽

pp. 1-38

Author(s):

Lynda Tamine ◽

Lorraine Goeuriot

Keyword(s):

Information Retrieval ◽

Health Informatics ◽

Medical Information ◽

State Of The Art ◽

Lessons Learned ◽

Semantic Search ◽

Future Research ◽

Cross Model ◽

Wide Range ◽

Search Systems

The explosive growth and widespread accessibility of medical information on the Internet have led to a surge of research activity in a wide range of scientific communities including health informatics and information retrieval (IR). One of the common concerns of this research, across these disciplines, is how to design either clinical decision support systems or medical search engines capable of providing adequate support for both novices (e.g., patients and their next-of-kin) and experts (e.g., physicians, clinicians) tackling complex tasks (e.g., search for diagnosis, search for a treatment). However, despite the significant multi-disciplinary research advances, current medical search systems exhibit low levels of performance. This survey provides an overview of the state of the art in the disciplines of IR and health informatics, and bridging these disciplines shows how semantic search techniques can facilitate medical IR. First,we will give a broad picture of semantic search and medical IR and then highlight the major scientific challenges. Second, focusing on the semantic gap challenge, we will discuss representative state-of-the-art work related to feature-based as well as semantic-based representation and matching models that support medical search systems. In addition to seminal works, we will present recent works that rely on research advancements in deep learning. Third, we make a thorough cross-model analysis and provide some findings and lessons learned. Finally, we discuss some open issues and possible promising directions for future research trends.

Question answering method for infrastructure damage information retrieval from textual data using bidirectional encoder representations from transformers

Automation in Construction ◽

10.1016/j.autcon.2021.104061 ◽

2022 ◽

Vol 134 ◽

pp. 104061

Author(s):

Yohan Kim ◽

Seongdeok Bang ◽

Jiu Sohn ◽

Hyoungkwan Kim

Keyword(s):

Information Retrieval ◽

Question Answering ◽

Textual Data

An automatic query expansion based on hybrid CMO-COOT algorithm for optimized information retrieval

The Journal of Supercomputing ◽

10.1007/s11227-021-04171-y ◽

2022 ◽

Author(s):

Abdullah Saleh Alqahtani ◽

P. Saravanan ◽

M. Maheswari ◽

Sami Alshmrany

Keyword(s):

Information Retrieval ◽

Query Expansion

The speed of information propagation in the scientific network distorts biomedical research

PeerJ ◽

10.7717/peerj.12764 ◽

2022 ◽

Vol 10 ◽

pp. e12764

Author(s):

Raul Rodriguez-Esteban

Keyword(s):

Information Retrieval ◽

Scientific Literature ◽

Negative Impact ◽

Scientific Progress ◽

Chemical Compounds ◽

Scientific Work ◽

Scientific Fact ◽

Path Dependent ◽

Scientific Facts ◽

Do So

Delays in the propagation of scientific discoveries across scientific communities have been an oft-maligned feature of scientific research for introducing a bias towards knowledge that is produced within a scientist’s closest community. The vastness of the scientific literature has been commonly blamed for this phenomenon, despite recent improvements in information retrieval and text mining. Its actual negative impact on scientific progress, however, has never been quantified. This analysis attempts to do so by exploring its effects on biomedical discovery, particularly in the discovery of relations between diseases, genes and chemical compounds. Results indicate that the probability that two scientific facts will enable the discovery of a new fact depends on how far apart these two facts were originally within the scientific landscape. In particular, the probability decreases exponentially with the citation distance. Thus, the direction of scientific progress is distorted based on the location in which each scientific fact is published, representing a path-dependent bias in which originally closely-located discoveries drive the sequence of future discoveries. To counter this bias, scientists should open the scope of their scientific work with modern information retrieval and extraction approaches.

Measurement of clustering effectiveness for document collections

Information Retrieval ◽

10.1007/s10791-021-09401-8 ◽

2022 ◽

Author(s):

Meng Yuan ◽

Justin Zobel ◽

Pauline Lin

Keyword(s):

Information Retrieval ◽

Measurement Techniques ◽

High Dimensionality ◽

Clustering Methods ◽

Clustering Method ◽

Similar Material ◽

Document Collections ◽

Clustering Techniques

AbstractClustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.

information retrieval
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Building Cultural Heritage Reference Collections from Social Media through Pooling Strategies: The Case of 2020’s Tensions Over Race and Heritage

A Comprehensive Guideline for Bengali Sentiment Annotation

EMD Based Semantic User Similarity using Past Travel Histories

Opportunities and Challenges in Code Search Tools

EMD-Based Semantic User Similarity Using Past Travel Histories

Semantic Information Retrieval on Medical Texts

Question answering method for infrastructure damage information retrieval from textual data using bidirectional encoder representations from transformers

An automatic query expansion based on hybrid CMO-COOT algorithm for optimized information retrieval

The speed of information propagation in the scientific network distorts biomedical research

Measurement of clustering effectiveness for document collections

Export Citation Format

information retrievalRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Building Cultural Heritage Reference Collections from Social Media through Pooling Strategies: The Case of 2020’s Tensions Over Race and Heritage

A Comprehensive Guideline for Bengali Sentiment Annotation

EMD Based Semantic User Similarity using Past Travel Histories

Opportunities and Challenges in Code Search Tools

EMD-Based Semantic User Similarity Using Past Travel Histories

Semantic Information Retrieval on Medical Texts

Question answering method for infrastructure damage information retrieval from textual data using bidirectional encoder representations from transformers

An automatic query expansion based on hybrid CMO-COOT algorithm for optimized information retrieval

The speed of information propagation in the scientific network distorts biomedical research

Measurement of clustering effectiveness for document collections

information retrieval
Recently Published Documents