Using term similarity measures for classifying short document data

Author(s):  
Shuhei Toriyama ◽  
Hirohisa Seki
Author(s):  
Zhang Xiaodan ◽  
Jing Liping ◽  
Hu Xiaohua ◽  
Ng Michael ◽  
Xia Jiali ◽  
...  

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this article, we conduct a comparative study on how different term semantic similarity measures including path-based, informationcontent- based and feature-based similarity measure affect document clustering. Term re-weighting of document vector is an important method to integrate domain ontology to clustering process. In detail, the weight of a term is augmented by the weights of its co-occurred concepts. Spherical k-means are used for evaluate document vector re-weighting on two real-world datasets: Disease10 and OHSUMED23. Experimental results on nine different semantic measures have shown that: (1) there is no certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.


2011 ◽  
pp. 2232-2243
Author(s):  
Xiaodan Zhang ◽  
Liping Jing ◽  
Xiaohua Hu ◽  
Michael Ng ◽  
Jiali Xia ◽  
...  

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this article, we conduct a comparative study on how different term semantic similarity measures including path-based, information-content- based and feature-based similarity measure affect document clustering. Term re-weighting of document vector is an important method to integrate domain ontology to clustering process. In detail, the weight of a term is augmented by the weights of its co-occurred concepts. Spherical k-means are used for evaluate document vector reweighting on two real-world datasets: Disease10 and OHSUMED23. Experimental results on nine different semantic measures have shown that: (1) there is no certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.


2020 ◽  
pp. 1-10
Author(s):  
Andrés Rosso-Mateus ◽  
Manuel Montes-y-Gómez ◽  
Paolo Rosso ◽  
Fabio A. González

2012 ◽  
Vol 2012 ◽  
pp. 1-17 ◽  
Author(s):  
Gaston K. Mazandu ◽  
Nicola J. Mulder

The wide coverage and biological relevance of the Gene Ontology (GO), confirmed through its successful use in protein function prediction, have led to the growth in its popularity. In order to exploit the extent of biological knowledge that GO offers in describing genes or groups of genes, there is a need for an efficient, scalable similarity measure for GO terms and GO-annotated proteins. While several GO similarity measures exist, none adequately addresses all issues surrounding the design and usage of the ontology. We introduce a new metric for measuring the distance between two GO terms using the intrinsic topology of the GO-DAG, thus enabling the measurement of functional similarities between proteins based on their GO annotations. We assess the performance of this metric using a ROC analysis on human protein-protein interaction datasets and correlation coefficient analysis on the selected set of protein pairs from the CESSM online tool. This metric achieves good performance compared to the existing annotation-based GO measures. We used this new metric to assess functional similarity between orthologues, and show that it is effective at determining whether orthologues are annotated with similar functions and identifying cases where annotation is inconsistent between orthologues.


2008 ◽  
Vol 4 (1) ◽  
pp. 62-73 ◽  
Author(s):  
Xiaodan Zhang ◽  
Liping Jing ◽  
Xiaohua Hu ◽  
Michael Ng ◽  
Jiali Xia Jiangxi ◽  
...  

The similarity between two synsets or concepts is a numeral measure of the degree to which the two objects are alike or not and the similarity measures say the degree of closeness between two synsets or concepts. The similarity or dissimilarity represented by the term proximity. Proximity measures are defined to have values in the interval [0, 1]. Term Similarity, Sentence similarity and Document similarity are the areas of text similarity. Term similarity measures used to measure the similarity between individual tokens and words, Sentence similarity is the similarity between two or more sentences and Document similarity used to measure the similarity between two or more corpora. This paper is the study between Knowledge based, Distribution based and prediction based semantic models and shows how knowledge based methods capturing information and prediction based methods preserving semantic information.


Sign in / Sign up

Export Citation Format

Share Document