web document clustering
Recently Published Documents


TOTAL DOCUMENTS

79
(FIVE YEARS 3)

H-INDEX

11
(FIVE YEARS 0)

With the rapid growth of web documents on WWW, it is becoming difficult to organize, analyze and present these documents efficiently. Web search engines return many documents to the web user, out of which some are relevant and some irrelevant documents to the topic, for the given query. Web search is usually performed using only features extracted from the web page text. HTML tags with particular meanings have been found to improve the efficiency of the information retrieval System. However, organizing documents in a way that will improve search without additional cost or complexity is still a great challenge. Clustering can play an important role to organize such a large number of documents into several groups. However due to limitations in existing techniques of clustering, scientists have begun using Meta-heuristic algorithms for the clustering problem of documents. In this paper, we presented a document clustering method that uses HTML tags and Metaheuristic approaches. The hybrid PSO+ACO+K-means algorithm is used for clustering the documents. In the proposed approach, results are analyzed on WEBKB dataset


The problem of web document clustering has been well studied. Web documents has been grouped based various features like textual, topical and semantic features. Number of approaches has been discussed earlier for the clustering of web documents. However the method does not produce promising results towards web document clustering. To overcome this, an efficient hierarchical semantic relational coverage based approach is presented in this paper. The method extracts the features of web document by preprocessing the document. The features extracted have been used to measure the semantic relational coverage measure in different levels. As the documents are grouped in a hierarchical manner, the method estimates the relational coverage measure in each level of the cluster. Based on the semantic relational measure at different level, the method estimates the topical semantic support measure. Using these two, the method computes the class weight. The estimated class weight has been used to perform document clustering. The proposed method improves the performance of document clustering and reduces the false classification ratio.


2017 ◽  
Vol 16 (01) ◽  
pp. 1750004 ◽  
Author(s):  
Hanan Al-Mofareji ◽  
Mahmoud Kamel ◽  
Mohamed Y. Dahab

Organizing web information is an important aspect of finding information in the easiest and most efficient way. We present a new method for web document clustering called WeDoCWT, which exploits the discrete wavelet transform and term signal, to improve the document representation. We studied different methods for document segmentation to construct the term signals. We used two datasets, UW-CAN and WebKB, to evaluate the proposed method. The experimental results indicated that dividing the documents into fixed segments is preferable to dividing them into logical segments based on HTML features because the web pages do not have the same structure. Mean TF–IDF reduction technique gives the best results in most cases. WeDoCWT gives [Formula: see text]-measure better than most of the previous approaches described in the literature. We used Munkres assignment algorithm to assign each produced cluster to the original class in order to evaluate the clustering results.


Sign in / Sign up

Export Citation Format

Share Document