Data Mining K-Means Document Clustering using TFIDF and Word Frequency Count

In the rapid development of www the amount of documents used increases in a rapid speed. This produces huge gigabyte of text document processing. For indexing as well as retrieving the required text document an efficient algorithms produce better performance by achieving good accuracy. The algorithms available in the field of data mining also provide a variety of new innovations regarding data mining. This increases the interest of the researchers to develop many essential models in the field of text data mining. In the proposed model is a two step text document clustering approach by K-Means algorithm. The first step includes Pre_Processing and second step includes clustering process. For Pre-Processing the method performs the tokenization approach. The distinct words are identified and the distinct words frequency of occurrence, TFIDF weights of the occurrences are calculated to form a document feature vector separately. In the clustering phase the feature vector is clustered by performing K-means algorithm by implementing various similarity measures.

Download Full-text

Interactive Document Clustering System Based on Coordinated Multiple Views

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2016.p0139 ◽

2016 ◽

Vol 20 (1) ◽

pp. 139-145 ◽

Cited By ~ 3

Author(s):

Yasufumi Takama ◽

◽

Takuma Tonegawa

Keyword(s):

Data Mining ◽

Document Clustering ◽

Document Retrieval ◽

Prototype System ◽

Multiple Views ◽

Multiple Objects ◽

Text Data ◽

Text Data Mining ◽

Specific Object ◽

Document Collection

This paper proposes an interactive document clustering system, which is designed based on the concept of CMV (coordinated multiple views). An interactive document clustering is used by a user to obtain a set of document groups from a document collection in interactive manner. It is expected to be useful for various tasks such as text mining and document retrieval. As the result of document clustering consists of multiple objects such as clusters (document groups), documents, and words, each of those should be presented to users in different ways. Based on this consideration, the proposed system employs multiple views, each of which is designed for specific object such as document and keyword. A prototype system is implemented on TETDM (Total Environment for Text Data Mining), which is one of environments for developing text data mining tools. As it can provide the mechanism of coordination between modules, we decided to use it for developing the prototype system. The proposed system classifies information to be presented into 4 levels: clusters, document, bag of words, and word, each of which is displayed with different views. Experimental results with test participants show the effectiveness of the proposed system.

Download Full-text

Efficient text document clustering with new similarity measures

International Journal of Business Intelligence and Data Mining ◽

10.1504/ijbidm.2021.111741 ◽

2021 ◽

Vol 18 (1) ◽

pp. 49

Author(s):

R. Lakshmi ◽

S. Baskar

Keyword(s):

Document Clustering ◽

Similarity Measures ◽

Text Document

Download Full-text

An Improved Document Clustering Approach with Multi-Viewpoint Based on Different Similarity Measures

2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS) ◽

10.1109/iccons.2018.8663209 ◽

2018 ◽

Author(s):

Aniali Gunta ◽

Rahul Dubey

Keyword(s):

Document Clustering ◽

Similarity Measures ◽

Clustering Approach

Download Full-text

Text Document Clustering using K-Means and Dbscan by using Machine Learning

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a2040.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 6327-6330

Keyword(s):

Machine Learning ◽

Social Networking ◽

Social Networking Sites ◽

Document Clustering ◽

Similar Type ◽

Data Sets ◽

Text Data ◽

Text Document ◽

Self Organized ◽

Density Based Clustering

With the growth of today’s world, text data is also increasing which are created by different media like social networking sites, web, and other informatics and sources e.t.c . Clustering is an important part of the data mining. Clustering is the procedure of cleave the large &similar type of text into the same group. Clustering is generally used in many applications like medical, biology, signal processing, etc. Algorithm contains traditional clustering like hierarchal clustering, density based clustering and self-organized map clustering. By using kmeans features and dbscan we can able to cluster the document. dbscan a part of clustering shows to a number of standard. The data sets will automatically evaluate the formulation of each and every part data through by the use of dbscan and k-means that will shows the clustering power of the data. document consists of multiple topic. Document clustering demands the context of signifier and form ancestry. Descriptors are the expression used to describe the satisfied inside the cluster.

Download Full-text

Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering

Mathematics ◽

10.3390/math9161929 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1929

Author(s):

Timea Bezdan ◽

Catalin Stoean ◽

Ahmed Al Naamany ◽

Nebojsa Bacanin ◽

Tarik A. Rashid ◽

...

Keyword(s):

Optimization Algorithm ◽

Document Clustering ◽

Fruit Fly ◽

Text Clustering ◽

Relevant Information ◽

Fruit Fly Optimization Algorithm ◽

Hybrid Swarm ◽

Text Data ◽

Fruit Fly Optimization ◽

Text Document

The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-fly optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods.

Download Full-text

Modified Cosine Similarity Measure based Data Classification in Data Mining

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9754.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 649-654

Keyword(s):

Machine Learning ◽

Data Mining ◽

Similarity Measure ◽

Dominant Role ◽

Similarity Measures ◽

Data Classification ◽

Cosine Similarity ◽

Machine Learning Techniques ◽

Text Data ◽

Cosine Similarity Measure

Text data analytics became an integral part of World Wide Web data management and Internet based applications rapidly growing all over the world. E-commerce applications are growing exponentially in the business field and the competitors in the E-commerce are gradually increasing many machine learning techniques for predicting business related operations with the aim of increasing the product sales to the greater extent. Usage of similarity measures is inevitable in modern day to day real applications. Cosine similarity plays a dominant role in text data mining applications such as text classification, clustering, querying, and searching and so on. A modified clustering based cosine similarity measure called MCS is proposed in this paper for data classification. The proposed method is experimentally verified by employing many UCI machine learning datasets involving categorical attributes. The proposed method is superior in producing more accurate classification results in majority of experiments conducted on the UCI machine learning datasets.

Download Full-text