SeqPAM

Author(s):  
Pradeep Kumar Kumar ◽  
Raju S. Bapi ◽  
P. Radha Krishna

With the growth in the number of web users and necessity for making information available on the web, the problem of web personalization has become very critical and popular. Developers are trying to customize a web site to the needs of specific users with the help of knowledge acquired from user navigational behavior. Since user page visits are intrinsically sequential in nature, efficient clustering algorithms for sequential data are needed. In this paper, we introduce a similarity preserving function called sequence and set similarity measure S3M that captures both the order of occurrence of page visits as well as the content of pages. We conducted pilot experiments comparing the results of PAM, a standard clustering algorithm, with two similarity measures: Cosine and S3M. The goodness of the clusters resulting from both the measures was computed using a cluster validation technique based on average levensthein distance. Results on pilot dataset established the effectiveness of S3M for sequential data. Based on these results, we proposed a new clustering algorithm, SeqPAM for clustering sequential data. We tested the new algorithm on two datasets namely, cti and msnbc datasets. We provided recommendations for web personalization based on the clusters obtained from SeqPAM for msnbc dataset.

2008 ◽  
pp. 3285-3307
Author(s):  
Pradeep Kumar ◽  
Raju S. Bapi ◽  
P. Radha Krishna

With the growth in the number of Web users and necessity for making information available on the Web, the problem of Web personalization has become very critical and popular. Developers are trying to customize a Web site to the needs of specific users with the help of knowledge acquired from user navigational behavior. Since user page visits are intrinsically sequential in nature, efficient clustering algorithms for sequential data are needed. In this chapter, we introduce a similarity preserving function called sequence and set similarity measure S3M that captures both the order of occurrence of page visits as well as the content of pages. We conducted pilot experiments comparing the results of PAM, a standard clustering algorithm, with two similarity measures: Cosine and S3M. The goodness of the clusters resulting from both the measures was computed using a cluster validation technique based on average levensthein distance. Results on pilot dataset established the effectiveness of S3M for sequential data. Based on these results, we proposed a new clustering algorithm, SeqPAM for clustering sequential data. We tested the new algorithm on two datasets namely, cti and msnbc datasets. We provided recommendations for Web personalization based on the clusters obtained from SeqPAM for msnbc dataset.


2010 ◽  
Vol 6 (4) ◽  
pp. 16-32 ◽  
Author(s):  
Pradeep Kumar ◽  
Bapi S. Raju ◽  
P. Radha Krishna

In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this paper, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA’98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.


Author(s):  
Pradeep Kumar ◽  
Bapi S. Raju ◽  
P. Radha Krishna

In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this chapter, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA’98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.


Author(s):  
Mohana Priya K ◽  
Pooja Ragavi S ◽  
Krishna Priya G

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%


2020 ◽  
Vol 28 (4) ◽  
pp. 531-561 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.


2013 ◽  
Vol 380-384 ◽  
pp. 1488-1494
Author(s):  
Wang Wei ◽  
Jin Yue Peng

In the research and development of intelligence system, clustering analysis is a very important problem. According to the new direct clustering algorithm using similarity measure of Vague sets as evaluation criteria presented by paper, the Vague direct clustering method are used to analysis using different similarity measure of Vague sets. The experimental result shows that the direct clustering method based on the similarity of Vague sets is effective, and the direct clustering method based on different similarity measure of Vague sets is the same basically, but difference on the steps of clustering. To select different algorithms according different conditions in the work of the actual applications.


2020 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.


2021 ◽  
Author(s):  
Athira T M ◽  
Sunil Jacob John ◽  
Harish Garg

Abstract Pythagorean fuzzy set (PFS) is a broadening of intuitionistic fuzzy set that can represent the situations where the sum of membership and the non-membership values exceeds one. Adding parameterization to PFS we obtain a structure named as Pythagorean fuzzy soft set (PFSS). It has a higher capacity to deal with vagueness as it captures both the structures of a PFS and a soft set. Several practical situations demand the measure of similarity between two structures, whose sum of membership value and non-membership value exceeds one. There are no existing tools to measure the similarity between PFSS and this paper put forward similarity measures for PFSS. An axiomatic definition for similarity measure is proposed for PFSS and certain expressions for similarity measure are introduced. Further, some theorems which express the properties of similarity measures are proved. A comparative study between proposed expressions for similarity measure is carried out. Also, a clustering algorithm based on PFSS is introduced by utilizing the proposed similarity measure.


Author(s):  
Subhanshu Goyal ◽  
Sushil Kumar ◽  
M. A. Zaveri ◽  
A. K. Shukla

In recent times, graph based spectral clustering algorithms have received immense attention in many areas like, data mining, object recognition, image analysis and processing. The commonly used similarity measure in the clustering algorithms is the Gaussian kernel function which uses sensitive scaling parameter and when applied to the segmentation of noise contaminated images leads to unsatisfactory performance because of neglecting the spatial pixel information. The present work introduces a novel framework for spectral clustering which embodied local spatial information and fuzzy based similarity measure to tackle the above mentioned issues. In our approach, firstly we filter the noise components from original image by using the spatial and gray–level information. The similarity matrix is then constructed by employing a similarity measure which takes into account the fuzzy c-partition matrix and vectors of the cluster centers obtained by fuzzy c-means clustering algorithm. In the last step, spectral clustering technique is realized on derived similarity matrix to obtain the desired segmentation result. Experimental results on segmentation of synthetic and Berkeley benchmark images with noise demonstrates the effectiveness and robustness of the proposed method, giving it an edge over the clustering based segmentation method reported in the literature.


2021 ◽  
Vol 7 ◽  
pp. e641
Author(s):  
Hassan I. Abdalla ◽  
Ali A. Amer

In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency–inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks.


Sign in / Sign up

Export Citation Format

Share Document