A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity

In this article, we present a graph-based knowledge representation for biomedical digital library literature clustering. An efficient clustering method is developed to identify the ontology-enriched k-highest density term subgraphs that capture the core semantic relationship information about each document cluster. The distance between each document and the k term graph clusters is calculated. A document is then assigned to the closest term cluster. The extensive experimental results on two PubMed document sets (Disease10 and OHSUMED23) show that our approach is comparable to spherical k-means. The contributions of our approach are the following: (1) we provide two corpus-level graph representations to improve document clustering, a term co-occurrence graph and an abstract-title graph; (2) we develop an efficient and effective document clustering algorithm by identifying k distinguishable class-specific core term subgraphs using terms’ global and local importance information; and (3) the identified term clusters give a meaningful explanation for the document clustering results.

Download Full-text

A Novel Ant-Based Clustering Approach for Document Clustering

Information Retrieval Technology - Lecture Notes in Computer Science ◽

10.1007/11880592_43 ◽

2006 ◽

pp. 537-544 ◽

Cited By ~ 10

Author(s):

Yulan He ◽

Siu Cheung Hui ◽

Yongxiang Sim

Keyword(s):

Document Clustering ◽

Clustering Approach

Download Full-text

Document Clustering Approach for Meta Search Engine

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/225/1/012291 ◽

2017 ◽

Vol 225 ◽

pp. 012291 ◽

Cited By ~ 2

Author(s):

Naresh Kumar

Keyword(s):

Search Engine ◽

Document Clustering ◽

Meta Search Engine ◽

Clustering Approach ◽

Meta Search

Download Full-text

Data Mining K-Means Document Clustering using TFIDF and Word Frequency Count

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1718.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 2542-2549

Keyword(s):

Data Mining ◽

Feature Vector ◽

Rapid Development ◽

Document Clustering ◽

Similarity Measures ◽

Second Step ◽

Text Data ◽

Text Document ◽

Proposed Model ◽

Clustering Approach

In the rapid development of www the amount of documents used increases in a rapid speed. This produces huge gigabyte of text document processing. For indexing as well as retrieving the required text document an efficient algorithms produce better performance by achieving good accuracy. The algorithms available in the field of data mining also provide a variety of new innovations regarding data mining. This increases the interest of the researchers to develop many essential models in the field of text data mining. In the proposed model is a two step text document clustering approach by K-Means algorithm. The first step includes Pre_Processing and second step includes clustering process. For Pre-Processing the method performs the tokenization approach. The distinct words are identified and the distinct words frequency of occurrence, TFIDF weights of the occurrences are calculated to form a document feature vector separately. In the clustering phase the feature vector is clustered by performing K-means algorithm by implementing various similarity measures.

Download Full-text