A New Similarity Measure for Document Classification and Text Mining

KnE Social Sciences ◽

10.18502/kss.v4i1.5999 ◽

2020 ◽

Author(s):

Mete Eminağaoğlu ◽

Yılmaz Gökşen

Keyword(s):

Knowledge Management ◽

Information Retrieval ◽

Text Mining ◽

Similarity Measure ◽

Pearson Correlation ◽

Similarity Measures ◽

Content Management ◽

Document Classification ◽

Classification Systems ◽

Textual Data

Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today’s world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems. Keywords: text mining, document classification, similarity measures, k-NN, Rocchio algorithm

Download Full-text

A new similarity measure for vector space models in text classification and information retrieval

Journal of Information Science ◽

10.1177/0165551520968055 ◽

2020 ◽

pp. 016555152096805

Author(s):

Mete Eminagaoglu

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Similarity Measure ◽

Text Classification ◽

Pearson Correlation ◽

Clustering Algorithms ◽

Similarity Measures ◽

Manhattan Distance ◽

Vector Space Models ◽

Classification Information

There are various models, methodologies and algorithms that can be used today for document classification, information retrieval and other text mining applications and systems. One of them is the vector space–based models, where distance metrics or similarity measures lie at the core of such models. Vector space–based model is one of the fast and simple alternatives for the processing of textual data; however, its accuracy, precision and reliability still need significant improvements. In this study, a new similarity measure is proposed, which can be effectively used for vector space models and related algorithms such as k-nearest neighbours ( k-NN) and Rocchio as well as some clustering algorithms such as K-means. The proposed similarity measure is tested with some universal benchmark data sets in Turkish and English, and the results are compared with some other standard metrics such as Euclidean distance, Manhattan distance, Chebyshev distance, Canberra distance, Bray–Curtis dissimilarity, Pearson correlation coefficient and Cosine similarity. Some successful and promising results have been obtained, which show that this proposed similarity measure could be alternatively used within all suitable algorithms and models for information retrieval, document clustering and text classification.

Download Full-text

UTILIZING LOCAL CONTEXT FOR EFFECTIVE INFORMATION RETRIEVAL

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622008002788 ◽

2008 ◽

Vol 07 (01) ◽

pp. 5-21 ◽

Cited By ~ 4

Author(s):

TANVEER J. SIDDIQUI ◽

UMA SHANKER TIWARY

Keyword(s):

Information Retrieval ◽

Similarity Measure ◽

Question Answering ◽

Graph Matching ◽

Information Filtering ◽

Similarity Measures ◽

Local Context ◽

Structural Variations ◽

Retrieval Effectiveness ◽

Relational Similarity

Our research focuses on the use of local context through relation matching to improve retrieval effectiveness. An information retrieval (IR) model that integrates relation and keyword matching has been used in this work. The model takes advantage of any existing relational similarity between documents and query to improve retrieval effectiveness. It gives high rank to a document in which the query concepts are involved in similar relationships as in the query, as compared to those in which they are related differently. A conceptual graph (CG) representation has been used to capture relationship between concepts. A simplified form of graph matching has been used to keep our model computationally tractable. Structural variations have been captured during matching through simple heuristics. Four different CG similarity measures have been proposed and used to evaluate performance of our model. We observed a maximum improvement of 7.37% in precision with the second CG similarity measure. The document collection used in this study is CACM-3204. CG similarity measure proposed by us is simple, flexible and scalable and can find application in many IR related tasks like information filtering, information extraction, question answering, document summarization, etc.

Download Full-text

A New Semantic Similarity Measure Based On Ontology for Movie Rate Prediction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4442.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 6756-6762

Keyword(s):

Semantic Similarity ◽

Similarity Measure ◽

Experimental Evaluation ◽

Pearson Correlation ◽

Similarity Measures ◽

Similarity Score ◽

Cosine Similarity ◽

Semantic Similarity Measure ◽

Rate Prediction ◽

Target User

A recommendation algorithm comprises of two important steps: 1) Predicting rates, and 2) Recommendation. Rate prediction is a cumulative function of the similarity score between two movies and rate history of those movies by other users. There are various methods for rate prediction such as weighted sum method, regression, deviation based etc. All these methods rely on finding similar items to the items previously viewed/rated by target user, with assumption that user tends to have similar rating for similar items. Computing the similarities can be done using various similarity measures such as Euclidian Distance, Cosine Similarity, Adjusted Cosine Similarity, Pearson Correlation, Jaccard Similarity etc. All of these well-known approaches calculate similarity score between two movies using simple rating based data. Hence, such similarity measures could not accurately model rating behavior of user. In this paper, we will show that the accuracy in rate prediction can be enhanced by incorporating ontological domain knowledge in similarity computation. This paper introduces a new ontological semantic similarity measure between two movies. For experimental evaluation, the performance of proposed approach is compared with two existing approaches: 1) Adjusted Cosine Similarity (ACS), and 2) Weighted Slope One (WSO) algorithm, in terms of two performance measures: 1) Execution time and 2) Mean Absolute Error (MAE). The open-source Movielens (ml-1m) dataset is used for experimental evaluation. As our results show, the ontological semantic similarity measure enhances the performance of rate prediction as compared to the existing-well known approaches.

Download Full-text

Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.14.11149 ◽

2018 ◽

Vol 7 (2.14) ◽

pp. 32

Author(s):

Siti Sakira Kamaruddin ◽

Yuhanis Yusof ◽

Nur Azzah Abu Bakar ◽

Mohamed Ahmed Tayie ◽

Ghaith Abdulsattar A.Jabbar Alkubaisi

Keyword(s):

Comparative Analysis ◽

Similarity Measure ◽

Semantic Analysis ◽

Similarity Measures ◽

Jaccard Similarity ◽

Inverse Document Frequency ◽

Document Frequency ◽

Sentence Level ◽

Textual Data ◽

Document Level

Textual data are a rich source of knowledge; hence, sentence comparison has become one of the important tasks in text mining related works. Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem. One of the reason is two sentences can convey the same meaning with totally dissimilar words. This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences. Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures.

Download Full-text

A New Hypred Improved Method for Measuring Concept Semantic Similarity in WordNet

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/4/1 ◽

2019 ◽

Vol 17 (4) ◽

pp. 433-439

Author(s):

Xiaogang Zhang ◽

Shouqian Sun ◽

Kejun Zhang

Keyword(s):

Artificial Intelligence ◽

Knowledge Management ◽

Information Retrieval ◽

Natural Language Processing ◽

Semantic Similarity ◽

Language Processing ◽

Similarity Measures ◽

Improved Method ◽

Hierarchical Tree ◽

Semantic Computation

Computing semantic similarity between concepts is an important issue in natural language processing, artificial intelligence, information retrieval and knowledge management. The measure of computing concept similarity is a fundament of semantic computation. In this paper, we analyze typical semantic similarity measures and note Wu and Palmer’s measure which does not distinguish the similarities between nodes from a node to different nodes of the same level. Then, we synthesize the advantages of measure of path-based and IC-based, and propose a new hybrid method for measuring semantic similarity. By testing on a fragment of WordNet hierarchical tree, the results demonstrate the proposed method accurately distinguishes the similarities between nodes from a node to different nodes of the same level and overcome the shortcoming of the Wu and Palmer’s measure

Download Full-text

Probability model of sensitive similarity measures in information retrieval

International Journal of Advanced Robotic Systems ◽

10.1177/1729881420901425 ◽

2020 ◽

Vol 17 (1) ◽

pp. 172988142090142

Author(s):

Xiaolong Gu ◽

Jie Zhang

Keyword(s):

Information Retrieval ◽

Similarity Measure ◽

Probability Model ◽

Rapid Development ◽

Similarity Measures ◽

Data Sets ◽

Retrieval Model ◽

Scale Parameters ◽

Internet Age ◽

Realistic Problem

In today’s Internet age, a lot of data is stored and used, which is very important. In people’s daily life, if these data are sorted, information retrieval technology will be used, and in information retrieval, some information retrieval inaccuracies often appear. Information retrieval model is an important framework and method for fast, complete, and accurate user information retrieval. With the rapid development of information technology, great changes have taken place in people’s production and life. Various information network technologies are widely used in people’s lives. The resulting flow of information shows explosive growth, information retrieval. User requirements are getting higher and higher. How to complete personalized information retrieval in a large amount of mixed information, so that retrieval technology can help us obtain effective retrieval results, has become a realistic problem worth exploring. In this article, the application of probability model based on sensitive similarity measure in information retrieval model is analyzed, and a similarity measure algorithm based on spectral clustering is proposed. By improving the similarity measure, the sensitivity problem of scale parameters is overcome and the retrieval precision is improved. In order to better reflect the superiority of the proposed algorithm, this article compares with ng-jordan-weiss (NJW) and deep sparse subspace clustering (DSSC) algorithms. The experimental results show that the proposed algorithm is superior to NJW and DSSC algorithms for different data sets in different evaluation indicators (Rand and F-measure).

Download Full-text

E-Government Documents and Data Clustering

Advances in Electronic Government, Digital Divide, and Regional Development - Handbook of Research on Democratic Strategies and Citizen-Centered E-Government Services ◽

10.4018/978-1-4666-7266-6.ch010 ◽

2015 ◽

pp. 164-191

Author(s):

Goran Šimić

Keyword(s):

Information Retrieval ◽

Data Clustering ◽

Similarity Measures ◽

Information Resources ◽

Specific Information ◽

Advanced Search ◽

Government Documents ◽

Textual Data ◽

The Cost ◽

Text Similarity Measures

This chapter is about documents and data clustering as a process of preparing the information resources stored in the e-government systems for advanced search. These resources are mainly represented as textual data stored as field values in the databases or located as documents in file repositories. Due to their growth in number, search for some specific information takes more time. Different techniques are used for this purpose. Most of them include information retrieval based on a variety of text similarity measures. The cost of such processing depends on preparation of resources for searching. Clustering represents the most commonly used technique for such a purpose, and this fact is the basic motive for this chapter.

Download Full-text

EMD Based Semantic User Similarity using Past Travel Histories

Journal of Cases on Information Technology ◽

10.4018/jcit.20220801oa04 ◽

2022 ◽

Vol 24 (3) ◽

pp. 0-0

Keyword(s):

Information Retrieval ◽

Mobile Devices ◽

Semantic Similarity ◽

Similarity Measure ◽

Similarity Measures ◽

Cost Effective ◽

Semantic Similarity Measure ◽

User Similarity ◽

Percentage Improvement ◽

The Cost

The cost-effective and easy availability of handheld mobile devices and ubiquity of location acquisition services such as GPS and GSM networks has helped expedient logging and sharing of location histories of mobile users. This work aims to find semantic user similarity using their past travel histories. Application of the semantic similarity measure can be found in tourism-related recommender systems and information retrieval. The paper presents Earth Mover’s Distance (EMD) based semantic user similarity measure using users' GPS logs. The similarity measure is applied and evaluated on the GPS dataset of 182 users collected from April 2007 to August 2012 by Microsoft's GeoLife project. The proposed similarity measure is compared with conventional similarity measures used in literature such as Jaccard, Dice, and Pearsons’ Correlation. The percentage improvement of EMD based approach over existing approaches in terms of average RMSE is 10.70%, and average MAE is 5.73%.

Download Full-text

A Combinative Similarity Computing Measure for Collaborative Filtering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.2919 ◽

2013 ◽

Vol 347-350 ◽

pp. 2919-2925 ◽

Cited By ~ 2

Author(s):

Lin Guo ◽

Qin Ke Peng

Keyword(s):

Collaborative Filtering ◽

Similarity Measure ◽

Pearson Correlation ◽

Similarity Measures ◽

Cosine Similarity ◽

Computation Complexity ◽

Satisfactory Performance ◽

Similarity Method

Similarity method is the key of the user-based collaborative filtering recommend algorithm. The traditional similarity measures, which cosine similarity, adjusted cosine similarity and Pearson correlation similarity are included, have some advantages such as simple, easy and fast, but with the sparse dataset they may lead to bad recommendation quality. In this article, we first research how the recommendation qualities using the three similarity methods respectively change with the different sparse datasets, and then propose a combinative similarity measure considering the account of items users co-rated. Compared with the three algorithms, our method shows its satisfactory performance with the same computation complexity.

Download Full-text