UTILIZING LOCAL CONTEXT FOR EFFECTIVE INFORMATION RETRIEVAL

Our research focuses on the use of local context through relation matching to improve retrieval effectiveness. An information retrieval (IR) model that integrates relation and keyword matching has been used in this work. The model takes advantage of any existing relational similarity between documents and query to improve retrieval effectiveness. It gives high rank to a document in which the query concepts are involved in similar relationships as in the query, as compared to those in which they are related differently. A conceptual graph (CG) representation has been used to capture relationship between concepts. A simplified form of graph matching has been used to keep our model computationally tractable. Structural variations have been captured during matching through simple heuristics. Four different CG similarity measures have been proposed and used to evaluate performance of our model. We observed a maximum improvement of 7.37% in precision with the second CG similarity measure. The document collection used in this study is CACM-3204. CG similarity measure proposed by us is simple, flexible and scalable and can find application in many IR related tasks like information filtering, information extraction, question answering, document summarization, etc.

Download Full-text

A new similarity measure for vector space models in text classification and information retrieval

Journal of Information Science ◽

10.1177/0165551520968055 ◽

2020 ◽

pp. 016555152096805

Author(s):

Mete Eminagaoglu

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Similarity Measure ◽

Text Classification ◽

Pearson Correlation ◽

Clustering Algorithms ◽

Similarity Measures ◽

Manhattan Distance ◽

Vector Space Models ◽

Classification Information

There are various models, methodologies and algorithms that can be used today for document classification, information retrieval and other text mining applications and systems. One of them is the vector space–based models, where distance metrics or similarity measures lie at the core of such models. Vector space–based model is one of the fast and simple alternatives for the processing of textual data; however, its accuracy, precision and reliability still need significant improvements. In this study, a new similarity measure is proposed, which can be effectively used for vector space models and related algorithms such as k-nearest neighbours ( k-NN) and Rocchio as well as some clustering algorithms such as K-means. The proposed similarity measure is tested with some universal benchmark data sets in Turkish and English, and the results are compared with some other standard metrics such as Euclidean distance, Manhattan distance, Chebyshev distance, Canberra distance, Bray–Curtis dissimilarity, Pearson correlation coefficient and Cosine similarity. Some successful and promising results have been obtained, which show that this proposed similarity measure could be alternatively used within all suitable algorithms and models for information retrieval, document clustering and text classification.

Download Full-text

The NarrativeQA Reading Comprehension Challenge

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00023 ◽

2018 ◽

Vol 6 ◽

pp. 317-328 ◽

Cited By ~ 24

Author(s):

Tomáš Kočiský ◽

Jonathan Schwarz ◽

Phil Blunsom ◽

Chris Dyer ◽

Karl Moritz Hermann ◽

...

Keyword(s):

Information Retrieval ◽

Reading Comprehension ◽

Pattern Matching ◽

Question Answering ◽

Local Context ◽

Artificial Agents ◽

Learning To Read ◽

Term Frequency ◽

Context Similarity

Reading comprehension (RC)—in contrast to information retrieval—requires integrating information and reasoning about events, entities, and their relations across a full document. Question answering is conventionally used to assess RC ability, in both artificial agents and children learning to read. However, existing RC datasets and tasks are dominated by questions that can be solved by selecting answers using superficial information (e.g., local context similarity or global term frequency); they thus fail to test for the essential integrative aspect of RC. To encourage progress on deeper comprehension of language, we present a new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts. These tasks are designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience. We show that although humans solve the tasks easily, standard RC models struggle on the tasks presented here. We provide an analysis of the dataset and the challenges it presents.

Download Full-text

A Comparison of Similarity Measures for Text Documents

Journal of Information & Knowledge Management ◽

10.1142/s0219649208001889 ◽

2008 ◽

Vol 07 (01) ◽

pp. 1-8 ◽

Cited By ~ 11

Author(s):

Shanmugasundaram Hariharan ◽

Rengaramanujam Srinivasan

Keyword(s):

Information Retrieval ◽

Question Answering ◽

Document Clustering ◽

Similarity Measures ◽

Text Documents ◽

Document Similarity

Similarity is an important and widely used concept in many applications such as Document Summarisation, Question Answering, Information Retrieval, Document Clustering and Categorisation. This paper presents a comparison of various similarity measures in comparing the content of text documents. We have attempted to find the best measure suited for finding the document similarity for newspaper reports.

Download Full-text

Probability model of sensitive similarity measures in information retrieval

International Journal of Advanced Robotic Systems ◽

10.1177/1729881420901425 ◽

2020 ◽

Vol 17 (1) ◽

pp. 172988142090142

Author(s):

Xiaolong Gu ◽

Jie Zhang

Keyword(s):

Information Retrieval ◽

Similarity Measure ◽

Probability Model ◽

Rapid Development ◽

Similarity Measures ◽

Data Sets ◽

Retrieval Model ◽

Scale Parameters ◽

Internet Age ◽

Realistic Problem

In today’s Internet age, a lot of data is stored and used, which is very important. In people’s daily life, if these data are sorted, information retrieval technology will be used, and in information retrieval, some information retrieval inaccuracies often appear. Information retrieval model is an important framework and method for fast, complete, and accurate user information retrieval. With the rapid development of information technology, great changes have taken place in people’s production and life. Various information network technologies are widely used in people’s lives. The resulting flow of information shows explosive growth, information retrieval. User requirements are getting higher and higher. How to complete personalized information retrieval in a large amount of mixed information, so that retrieval technology can help us obtain effective retrieval results, has become a realistic problem worth exploring. In this article, the application of probability model based on sensitive similarity measure in information retrieval model is analyzed, and a similarity measure algorithm based on spectral clustering is proposed. By improving the similarity measure, the sensitivity problem of scale parameters is overcome and the retrieval precision is improved. In order to better reflect the superiority of the proposed algorithm, this article compares with ng-jordan-weiss (NJW) and deep sparse subspace clustering (DSSC) algorithms. The experimental results show that the proposed algorithm is superior to NJW and DSSC algorithms for different data sets in different evaluation indicators (Rand and F-measure).

Download Full-text

EMD Based Semantic User Similarity using Past Travel Histories

Journal of Cases on Information Technology ◽

10.4018/jcit.20220801oa04 ◽

2022 ◽

Vol 24 (3) ◽

pp. 0-0

Keyword(s):

Information Retrieval ◽

Mobile Devices ◽

Semantic Similarity ◽

Similarity Measure ◽

Similarity Measures ◽

Cost Effective ◽

Semantic Similarity Measure ◽

User Similarity ◽

Percentage Improvement ◽

The Cost

The cost-effective and easy availability of handheld mobile devices and ubiquity of location acquisition services such as GPS and GSM networks has helped expedient logging and sharing of location histories of mobile users. This work aims to find semantic user similarity using their past travel histories. Application of the semantic similarity measure can be found in tourism-related recommender systems and information retrieval. The paper presents Earth Mover’s Distance (EMD) based semantic user similarity measure using users' GPS logs. The similarity measure is applied and evaluated on the GPS dataset of 182 users collected from April 2007 to August 2012 by Microsoft's GeoLife project. The proposed similarity measure is compared with conventional similarity measures used in literature such as Jaccard, Dice, and Pearsons’ Correlation. The percentage improvement of EMD based approach over existing approaches in terms of average RMSE is 10.70%, and average MAE is 5.73%.

Download Full-text

EMD-Based Semantic User Similarity Using Past Travel Histories

Journal of Cases on Information Technology ◽

10.4018/jcit.20220701.oa2 ◽

2022 ◽

Vol 24 (3) ◽

pp. 1-17

Author(s):

Sunita Tiwari ◽

Saroj Kaushik

Keyword(s):

Information Retrieval ◽

Mobile Devices ◽

Semantic Similarity ◽

Similarity Measure ◽

Similarity Measures ◽

Cost Effective ◽

Semantic Similarity Measure ◽

User Similarity ◽

Percentage Improvement ◽

The Cost

Download Full-text

A New Similarity Measure for Document Classification and Text Mining

KnE Social Sciences ◽

10.18502/kss.v4i1.5999 ◽

2020 ◽

Author(s):

Mete Eminağaoğlu ◽

Yılmaz Gökşen

Keyword(s):

Knowledge Management ◽

Information Retrieval ◽

Text Mining ◽

Similarity Measure ◽

Pearson Correlation ◽

Similarity Measures ◽

Content Management ◽

Document Classification ◽

Classification Systems ◽

Textual Data

Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today’s world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems. Keywords: text mining, document classification, similarity measures, k-NN, Rocchio algorithm

Download Full-text

Model for Evaluation of Information Retrieval Effectiveness within Semantic Web Concept

Proceedings of Universities ELECTRONICS ◽

10.24151/1561-5405-2018-23-3-308-312 ◽

2018 ◽

Vol 23 (3) ◽

pp. 308-312

Author(s):

V.V. Sliusar ◽

Keyword(s):

Information Retrieval ◽

Semantic Web ◽

Retrieval Effectiveness

Download Full-text

MATHURA (MBI) - A NOVEL IMPUTATION MEASURE FOR IMPUTATION OF MISSING VALUES IN MEDICAL DATASETS

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666191216123352 ◽

2019 ◽

Vol 13 ◽

Author(s):

B. Mathura Bai ◽

N. Mangathayaru ◽

B. Padmaja Rani ◽

Shadi Aljawarneh

Keyword(s):

Similarity Measure ◽

Medical Records ◽

Missing Values ◽

Similarity Measures ◽

Common Problems ◽

Experiment Analysis

: Missing attribute values in medical datasets are one of the most common problems faced when mining medical datasets. Estimation of missing values is a major challenging task in pre-processing of datasets. Any wrong estimate of missing attribute values can lead to inefficient and improper classification thus resulting in lower classifier accuracies. Similarity measures play a key role during the imputation process. The use of an appropriate and better similarity measure can help to achieve better imputation and improved classification accuracies. This paper proposes a novel imputation measure for finding similarity between missing and non-missing instances in medical datasets. Experiments are carried by applying both the proposed imputation technique and popular benchmark existing imputation techniques. Classification is carried using KNN, J48, SMO and RBFN classifiers. Experiment analysis proved that after imputation of medical records using proposed imputation technique, the resulting classification accuracies reported by the classifiers KNN, J48 and SMO have improved when compared to other existing benchmark imputation techniques.

Download Full-text

Similarity Measure Approaches Applied in Text Document Clustering for Information Retrieval

2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC) ◽

10.1109/pdgc50313.2020.9315851 ◽

2020 ◽

Author(s):

Naveen Kumar ◽

Sanjay Kumar Yadav ◽

Divakar Singh Yadav

Keyword(s):

Information Retrieval ◽

Similarity Measure ◽

Document Clustering ◽

Text Document

Download Full-text