scholarly journals Web Page Clustering using Heuristic Search in the Web Graph

Author(s):  
Ron Bekkerman ◽  
Shlomo Zilberstein ◽  
James Allan
Information ◽  
2018 ◽  
Vol 9 (9) ◽  
pp. 228 ◽  
Author(s):  
Zuping Zhang ◽  
Jing Zhao ◽  
Xiping Yan

Web page clustering is an important technology for sorting network resources. By extraction and clustering based on the similarity of the Web page, a large amount of information on a Web page can be organized effectively. In this paper, after describing the extraction of Web feature words, calculation methods for the weighting of feature words are studied deeply. Taking Web pages as objects and Web feature words as attributes, a formal context is constructed for using formal concept analysis. An algorithm for constructing a concept lattice based on cross data links was proposed and was successfully applied. This method can be used to cluster the Web pages using the concept lattice hierarchy. Experimental results indicate that the proposed algorithm is better than previous competitors with regard to time consumption and the clustering effect.


Author(s):  
Ben Choi

Web mining aims for searching, organizing, and extracting information on the Web and search engines focus on searching. The next stage of Web mining is the organization of Web contents, which will then facilitate the extraction of useful information from the Web. This chapter will focus on organizing Web contents. Since a majority of Web contents are stored in the form of Web pages, this chapter will focus on techniques for automatically organizing Web pages into categories. Various artificial intelligence techniques have been used; however the most successful ones are classification and clustering. This chapter will focus on clustering. Clustering is well suited for Web mining by automatically organizing Web pages into categories each of which contain Web pages having similar contents. However, one problem in clustering is the lack of general methods to automatically determine the number of categories or clusters. For the Web domain, until now there is no such a method suitable for Web page clustering. To address this problem, this chapter describes a method to discover a constant factor that characterizes the Web domain and proposes a new method for automatically determining the number of clusters in Web page datasets. This chapter also proposes a new bi-directional hierarchical clustering algorithm, which arranges individual Web pages into clusters and then arranges the clusters into larger clusters and so on until the average inter-cluster similarity approaches the constant factor. Having the constant factor together with the algorithm, this chapter provides a new clustering system suitable for mining the Web.


2014 ◽  
Vol 667 ◽  
pp. 277-285 ◽  
Author(s):  
Fang Chen ◽  
Yan Hui Zhou

With the rapid development of Internet, tag technology has been widely used in various sites. The brief text labels of network resources are greatly convenient for people to access the massive data. Social tags allows the user to use any word ----to tag network objects, and to share these tags, because of its simple and flexible operation, and it has become one of the popular applications. However, there exists some problems like noise of tags, lack of using criteria, and sparse distribution etc. Especially sparsity of tags seriously limits its application in the semantic analysis of web pages. This paper, by exploiting the user-related tag expansion method to overcome this problem, at the same time by using the topic model----LDA to model the web tags, mine its potential topic from the large-scale web page, and obtain the topic distribution of the text to the text clustering analysis. The experimental results show that, compared with the traditional clustering algorithm, the method of based LDA clustering on the analysis of the web tags have a larger increase.


2018 ◽  
Vol 7 (4.10) ◽  
pp. 566
Author(s):  
B Jaganathan ◽  
Kalyani Desikan

In today's era of computer technology where users want not only the most relevant data but they also want the data as quickly as possible. Hence, ranking web pages becomes a crucial task. The purpose of this research is to find a centrality measure that can be used in place of original page rank. In this article concept of Laplacian centrality measure for directed web graph has been introduced to identify the web page ranks. Comparison between the original page rank and Laplacian centrality based Page rank has been made. Kendall's  correlation co-efficient has been used as a measure to find the correlation between the original page rank and Laplacian centrality measure based page rank.  


2011 ◽  
Vol 219-220 ◽  
pp. 98-102
Author(s):  
Kai Xi Xie ◽  
Ting Gui Chen

In this paper, we combine the web mining and fuzzy clustering and give the concept of web fuzzy clustering processing model and its application. We also introduce the web fuzzy direct clustering method in brief. Web fuzzy clustering can be used in the web user clustering and web page clustering of web usage mining.


2020 ◽  
pp. 151-156
Author(s):  
A. P. Korablev ◽  
N. S. Liksakova ◽  
D. M. Mirin ◽  
D. G. Oreshkin ◽  
P. G. Efimov

A new species list of plants and lichens of Russia and neighboring countries has been developed for Turboveg for Windows, the program, intended for storage and management of phytosociological data (relevés), is widely used all around the world (Hennekens, Schaminée, 2001; Hennekens, 2015). The species list is built upon the database of the Russian website Plantarium (Plantarium…: [site]), which contains a species atlas and illustrated an online Handbook of plants and lichens. The nomenclature used on Plantarium was originally based on the following issues: vascular plants — S. K. Cherepanov (1995) with additions; mosses — «Flora of mosses of Russia» (Proect...: [site]); liverworts and hornworts — A. D. Potemkin and E. V. Sofronova (2009); lichens — «Spisok…» G. P. Urbanavichyus ed. (2010); other sources (Plantarium...: [site]). The new species list, currently the most comprehensive in Turboveg format for Russia, has 89 501 entries, including 4627 genus taxa compare to the old one with 32 020 entries (taxa) and only 253 synonyms. There are 84 805 species and subspecies taxa in the list, 37 760 (44.7 %) of which are accepted, while the others are synonyms. Their distribution by groups of organisms and divisions are shown in Table. A large number of synonyms in the new list and its adaptation to work with the Russian literature will greatly facilitate the entry of old relevé data. The ways of making new list, its structure as well as the possibilities of checking taxonomic lists on Internet resources are considered. The files of the species list for Turboveg 2 and Turboveg 3, the technique of associating existing databases with a new species list (in Russian) are available on the web page https://www.binran.ru/resursy/informatsionnyye-resursy/tekuschie-proekty/species_list_russia/.


2010 ◽  
Vol 30 (3) ◽  
pp. 818-820
Author(s):  
Rui LI ◽  
Jun-yu ZENG ◽  
Si-wang ZHOU

2009 ◽  
Author(s):  
Mirko Luca Lobina ◽  
Davide Mula
Keyword(s):  
Web Page ◽  

2021 ◽  
Vol 13 (2) ◽  
pp. 50
Author(s):  
Hamed Z. Jahromi ◽  
Declan Delaney ◽  
Andrew Hines

Content is a key influencing factor in Web Quality of Experience (QoE) estimation. A web user’s satisfaction can be influenced by how long it takes to render and visualize the visible parts of the web page in the browser. This is referred to as the Above-the-fold (ATF) time. SpeedIndex (SI) has been widely used to estimate perceived web page loading speed of ATF content and a proxy metric for Web QoE estimation. Web application developers have been actively introducing innovative interactive features, such as animated and multimedia content, aiming to capture the users’ attention and improve the functionality and utility of the web applications. However, the literature shows that, for the websites with animated content, the estimated ATF time using the state-of-the-art metrics may not accurately match completed ATF time as perceived by users. This study introduces a new metric, Plausibly Complete Time (PCT), that estimates ATF time for a user’s perception of websites with and without animations. PCT can be integrated with SI and web QoE models. The accuracy of the proposed metric is evaluated based on two publicly available datasets. The proposed metric holds a high positive Spearman’s correlation (rs=0.89) with the Perceived ATF reported by the users for websites with and without animated content. This study demonstrates that using PCT as a KPI in QoE estimation models can improve the robustness of QoE estimation in comparison to using the state-of-the-art ATF time metric. Furthermore, experimental result showed that the estimation of SI using PCT improves the robustness of SI for websites with animated content. The PCT estimation allows web application designers to identify where poor design has significantly increased ATF time and refactor their implementation before it impacts end-user experience.


Sign in / Sign up

Export Citation Format

Share Document