Using term similarity measures for classifying short document data

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this article, we conduct a comparative study on how different term semantic similarity measures including path-based, informationcontent- based and feature-based similarity measure affect document clustering. Term re-weighting of document vector is an important method to integrate domain ontology to clustering process. In detail, the weight of a term is augmented by the weights of its co-occurred concepts. Spherical k-means are used for evaluate document vector re-weighting on two real-world datasets: Disease10 and OHSUMED23. Experimental results on nine different semantic measures have shown that: (1) there is no certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.

Download Full-text

Medical Document Clustering Using Ontology-Based Term Similarity Measures

Medical Informatics ◽

10.4018/978-1-60566-050-9.ch169 ◽

2011 ◽

pp. 2232-2243

Author(s):

Xiaodan Zhang ◽

Liping Jing ◽

Xiaohua Hu ◽

Michael Ng ◽

Jiali Xia ◽

...

Keyword(s):

Semantic Similarity ◽

Domain Knowledge ◽

Document Clustering ◽

Similarity Measures ◽

Concept Hierarchy ◽

Term Similarity ◽

Feature Based ◽

Document Vector ◽

Real World Datasets ◽

Medical Document

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this article, we conduct a comparative study on how different term semantic similarity measures including path-based, information-content- based and feature-based similarity measure affect document clustering. Term re-weighting of document vector is an important method to integrate domain ontology to clustering process. In detail, the weight of a term is augmented by the weights of its co-occurred concepts. Spherical k-means are used for evaluate document vector reweighting on two real-world datasets: Disease10 and OHSUMED23. Experimental results on nine different semantic measures have shown that: (1) there is no certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.

Download Full-text

Using term similarity measures for classifying short document data

International Journal of Computational Intelligence Studies ◽

10.1504/ijcistudies.2021.115430 ◽

2021 ◽

Vol 10 (2/3) ◽

pp. 181

Author(s):

Hirohisa Seki ◽

Shuhei Toriyama

Keyword(s):

Similarity Measures ◽

Term Similarity

Download Full-text

Deep fusion of multiple term-similarity measures for biomedical passage retrieval

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-179887 ◽

2020 ◽

pp. 1-10

Author(s):

Andrés Rosso-Mateus ◽

Manuel Montes-y-Gómez ◽

Paolo Rosso ◽

Fabio A. González

Keyword(s):

Similarity Measures ◽

Passage Retrieval ◽

Term Similarity

Download Full-text

A Topology-Based Metric for Measuring Term Similarity in the Gene Ontology

Advances in Bioinformatics ◽

10.1155/2012/975783 ◽

2012 ◽

Vol 2012 ◽

pp. 1-17 ◽

Cited By ~ 27

Author(s):

Gaston K. Mazandu ◽

Nicola J. Mulder

Keyword(s):

Gene Ontology ◽

Protein Function ◽

Protein Function Prediction ◽

Similarity Measures ◽

Biological Knowledge ◽

Online Tool ◽

Protein Protein Interaction ◽

Or Groups ◽

Term Similarity ◽

Go Terms

The wide coverage and biological relevance of the Gene Ontology (GO), confirmed through its successful use in protein function prediction, have led to the growth in its popularity. In order to exploit the extent of biological knowledge that GO offers in describing genes or groups of genes, there is a need for an efficient, scalable similarity measure for GO terms and GO-annotated proteins. While several GO similarity measures exist, none adequately addresses all issues surrounding the design and usage of the ontology. We introduce a new metric for measuring the distance between two GO terms using the intrinsic topology of the GO-DAG, thus enabling the measurement of functional similarities between proteins based on their GO annotations. We assess the performance of this metric using a ROC analysis on human protein-protein interaction datasets and correlation coefficient analysis on the selected set of protein pairs from the CESSM online tool. This metric achieves good performance compared to the existing annotation-based GO measures. We used this new metric to assess functional similarity between orthologues, and show that it is effective at determining whether orthologues are annotated with similar functions and identifying cases where annotation is inconsistent between orthologues.

Download Full-text

A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering

Advances in Databases: Concepts, Systems and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-540-71703-4_12 ◽

2007 ◽

pp. 115-126 ◽

Cited By ~ 37

Author(s):

Xiaodan Zhang ◽

Liping Jing ◽

Xiaohua Hu ◽

Michael Ng ◽

Xiaohua Zhou

Keyword(s):

Comparative Study ◽

Document Clustering ◽

Similarity Measures ◽

Term Similarity

Download Full-text

Medical Document Clustering Using Ontology-Based Term Similarity Measures

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2008010104 ◽

2008 ◽

Vol 4 (1) ◽

pp. 62-73 ◽

Cited By ~ 26

Author(s):

Xiaodan Zhang ◽

Liping Jing ◽

Xiaohua Hu ◽

Michael Ng ◽

Jiali Xia Jiangxi ◽

...

Keyword(s):

Document Clustering ◽

Similarity Measures ◽

Term Similarity ◽

Medical Document

Download Full-text

Evaluating topology-based metrics for GO term similarity measures

2013 IEEE International Conference on Bioinformatics and Biomedicine ◽

10.1109/bibm.2013.6732457 ◽

2013 ◽

Cited By ~ 2

Author(s):

Jong Cheol Jeong ◽

Xue-wen Chen

Keyword(s):

Similarity Measures ◽

Term Similarity

Download Full-text

Semantic Similarity Analysis on Knowledge Based and Prediction Based Models

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f3783.049620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 477-481

Keyword(s):

Semantic Information ◽

Similarity Measures ◽

Similarity Analysis ◽

Document Similarity ◽

Knowledge Based ◽

Sentence Similarity ◽

Semantic Models ◽

Term Similarity ◽

Proximity Measures ◽

Semantic Similarity Analysis

The similarity between two synsets or concepts is a numeral measure of the degree to which the two objects are alike or not and the similarity measures say the degree of closeness between two synsets or concepts. The similarity or dissimilarity represented by the term proximity. Proximity measures are defined to have values in the interval [0, 1]. Term Similarity, Sentence similarity and Document similarity are the areas of text similarity. Term similarity measures used to measure the similarity between individual tokens and words, Sentence similarity is the similarity between two or more sentences and Document similarity used to measure the similarity between two or more corpora. This paper is the study between Knowledge based, Distribution based and prediction based semantic models and shows how knowledge based methods capturing information and prediction based methods preserving semantic information.

Download Full-text