Genetic Programming for Evolving Similarity Functions Tailored to Clustering Algorithms

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

Download Full-text

Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

10.26686/wgtn.13058777 ◽

2020 ◽

Author(s):

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Similarity Measures ◽

Small Subset ◽

Similarity Functions ◽

New Approach ◽

Performance Improvements ◽

Consistent Performance ◽

High Dimensional Datasets

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

Download Full-text

Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

10.26686/wgtn.13058777.v1 ◽

2020 ◽

Author(s):

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Similarity Measures ◽

Small Subset ◽

Similarity Functions ◽

New Approach ◽

Performance Improvements ◽

Consistent Performance ◽

High Dimensional Datasets

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

Download Full-text

Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

10.26686/wgtn.12493814 ◽

2020 ◽

Author(s):

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Similarity Measures ◽

Small Subset ◽

Similarity Functions ◽

New Approach ◽

Performance Improvements ◽

Consistent Performance ◽

High Dimensional Datasets

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

Download Full-text

Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

10.26686/wgtn.12493814.v1 ◽

2020 ◽

Author(s):

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Similarity Measures ◽

Small Subset ◽

Similarity Functions ◽

New Approach ◽

Performance Improvements ◽

Consistent Performance ◽

High Dimensional Datasets

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

Download Full-text

Learning similarity functions for binary strings via genetic programming

2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS) ◽

10.1109/icacsis.2016.7872773 ◽

2016 ◽

Author(s):

Muhammad Syahid Pebriadi ◽

Vektor Dewanto ◽

Wisnu Ananta Kusuma ◽

Farit Mochamad Afendi ◽

Rudi Heryanto

Keyword(s):

Genetic Programming ◽

Similarity Functions ◽

Binary Strings

Download Full-text

Remote Sensing Image Classification Using Genetic-Programming-Based Time Series Similarity Functions

IEEE Geoscience and Remote Sensing Letters ◽

10.1109/lgrs.2017.2719033 ◽

2017 ◽

Vol 14 (9) ◽

pp. 1499-1503 ◽

Cited By ~ 6

Author(s):

Alexandre E. Almeida ◽

Ricardo da S. Torres

Keyword(s):

Remote Sensing ◽

Time Series ◽

Genetic Programming ◽

Image Classification ◽

Remote Sensing Image ◽

Similarity Functions ◽

Remote Sensing Image Classification

Download Full-text

Handling WSD using Hierarchical Clustering Algorithm with sentences

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset1841120 ◽

2018 ◽

pp. 83-88

Author(s):

Mohana Priya K ◽

Pooja Ragavi S ◽

Krishna Priya G

Keyword(s):

Hierarchical Clustering ◽

Similarity Measure ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Cosine Similarity Measure ◽

Hierarchical Clustering Algorithm ◽

Multiple Levels ◽

Pos Tagger ◽

Sentence Clustering ◽

The Right

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%

Download Full-text