Semantic Based Weighted Web Session Clustering Using Adapted K-Means and Hierarchical Agglomerative Algorithms

Journal of Web Engineering ◽

10.13052/jwe1540-9589.2125 ◽

2022 ◽

Author(s):

Sowmya HK ◽

R. J. Anandhi

Keyword(s):

Clustering Algorithms ◽

Threshold Value ◽

Semantic Distance ◽

Web Usage Mining ◽

Identification Algorithm ◽

Agglomerative Clustering ◽

Dissimilarity Matrix ◽

Identification Methods ◽

Web Usage ◽

Stay Time

The WWW has a big number of pages and URLs that supply the user with a great amount of content. In an intensifying epoch of information, analysing users browsing behaviour is a significant affair. Web usage mining techniques are applied to the web server log to analyse the user behaviour. Identification of user sessions is one of the key and demanding tasks in the pre-processing stage of web usage mining. This paper emphasizes on two important fallouts with the approaches used in the existing session identification methods such as Time based and Referrer based sessionization. The first is dealing with comparing of current request’s referrer field with the URL of previous request. The second is dealing with session creation, new sessions are created or comes in to one session due to threshold value of page stay time and session time. So, authors developed enhanced semantic distance based session identification algorithm that tackles above mentioned issues of traditional session identification methods. The enhanced semantic based method has an accuracy of 84 percent, which is higher than the Time based and Time-Referrer based session identification approaches. The authors also used adapted K-Means and Hierarchical Agglomerative clustering algorithms to improve the prediction of user browsing patterns. Clusters were found using a weighted dissimilarity matrix, which is calculated using two key parameters: page weight and session weight. The Dunn Index and Davies-Bouldin Index are then used to evaluate the clusters. Experimental results shows that more pure and accurate session clusters are formed when adapted clustering algorithms are applied on the weighted sessions rather than the session obtained from traditional sessionization algorithms. Accuracy of the semantic session cluster is higher compared with the cluster of sessions obtained using traditional sessionization.

Download Full-text

A MapReduce-Based User Identification Algorithm in Web Usage Mining

International Journal of Information Technology and Web Engineering ◽

10.4018/ijitwe.2018040102 ◽

2018 ◽

Vol 13 (2) ◽

pp. 11-23 ◽

Cited By ~ 1

Author(s):

Mitali Srivastava ◽

Rakhi Garg ◽

P.K. Mishra

Keyword(s):

Web Usage Mining ◽

Identification Algorithm ◽

Proxy Server ◽

User Identification ◽

Ip Address ◽

Identification Methods ◽

The Third ◽

Web Usage ◽

Challenging Tasks ◽

Effectiveness And Efficiency

This article contends that in the booming era of information, analysing users' navigation behaviour is an important task. User identification is considered as one of the important and challenging tasks in the data preprocessing phase of the Web usage mining process. There are three important issues with the reactive strategies of User identification methods that need to be focused: the first is dealing of sharing IP address problem in a proxy server environment, the second is distinguishing users from Web robots, and the third is dealing with huge datasets efficiently. In this article, authors have developed a MapReduce-based User identification algorithm that deals with the above mentioned three issues related to user identification methods. Moreover, the experiment on the real web server log shows the effectiveness and efficiency of the developed algorithm.

Download Full-text

Research on Improved Clustering Algorithm on Web Usage Mining Based on Scientific Analysis of Web Materials

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.63-64.863 ◽

2011 ◽

Vol 63-64 ◽

pp. 863-867 ◽

Cited By ~ 1

Author(s):

Bin Li ◽

Jin Yang ◽

Cai Ming Liu ◽

Jian Dong Zhang ◽

Yan Zhang

Keyword(s):

Clustering Algorithm ◽

Hamming Distance ◽

Clustering Algorithms ◽

Distance Matrix ◽

Threshold Value ◽

Web Usage Mining ◽

Web Usage ◽

User Clustering ◽

Similar Index ◽

Browsing Behavior

Clustering analysis is an important method to research the Web user’s browsing behavior and identify the potential customers on Web usage mining. The traditional user clustering algorithms are not quite accurate. In this paper, we give two improved user clustering algorithms, which are based on the associated matrix of the user’s hits in the process of browsing website. To this matrix, an improved Hamming distance matrix is generated by defining the minimum norm or the generalized relative Hamming distance between any two vectors. Then, similar user clustering are obtained by setting the threshold value. At the last step of our algorithm, the clustering results are confirmed by defining the clustering’s Similar Index and setting sub-algorithm. Finally, the testing examples show that the new algorithms are more accurate than the old one, and the real log data presents that the improved algorithms are practical.

Download Full-text