A Preference Model on Adaptive Affinity Propagation

In recent years, two new data clustering algorithms have been proposed. One of them isAffinity Propagation (AP). AP is a new data clustering technique that use iterative message passing and consider all data points as potential exemplars. Two important inputs of AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although the original AP algorithm has shown much success in data clustering, it still suffer from one limitation: it is not easy to determine the value of the parameter ”preference” p which can result an optimal clustering solution. To resolve this limitation, we propose a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running. MAAP procedure means that we omit the adaptive p-scanning algorithm as in original Adaptive-AP (AAP) procedure. Experimental results on random non-partition and partition data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are comparable with those original AP. Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Summary of Affinity Propagation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.811 ◽

2011 ◽

Vol 268-270 ◽

pp. 811-816

Author(s):

Yong Zhou ◽

Yan Xing

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Affinity Propagation ◽

Damping Factor ◽

Data Sets ◽

Similarity Matrix ◽

Data Points

Affinity Propagation(AP)is a new clustering algorithm, which is based on the similarity matrix between pairs of data points and messages are exchanged between data points until clustering result emerges. It is efficient and fast , and it can solve the clustering on large data sets. But the traditional Affinity Propagation has many limitations, this paper introduces the Affinity Propagation, and analyzes in depth the advantages and limitations of it, focuses on the improvements of the algorithm — improve the similarity matrix, adjust the preference and the damping-factor, combine with other algorithms. Finally, discusses the development of Affinity Propagation.

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

SPSM: A NEW HYBRID DATA CLUSTERING ALGORITHM FOR NONLINEAR DATA ANALYSIS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007685 ◽

2009 ◽

Vol 23 (08) ◽

pp. 1701-1737 ◽

Cited By ~ 3

Author(s):

UREERAT WATTANACHON ◽

CHIDCHANOK LURSINSAP

Keyword(s):

Clustering Algorithm ◽

Color Image ◽

Clustering Algorithms ◽

Noisy Data ◽

Second Phase ◽

Data Sets ◽

Data Set ◽

Cluster Distance ◽

Data Points ◽

Hybrid Data

Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on predefined parameters specified by users. These algorithms may be unsuccessful if the choice of parameters is inappropriate with respect to the data set being clustered. Most of these algorithms work very well for compact and hyper-spherical clusters. In this paper, a new hybrid clustering algorithm called Self-Partition and Self-Merging (SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase and, then, removes the noisy data in the second phase. In the third phase, the normal subclusters are continuously merged to form the larger clusters based on the inter-cluster distance and intra-cluster distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set, and to cluster the data sets of arbitrary shapes of different density. Several examples for color image show the versatility of the proposed method and compare with results described in the literature for the same images. The computational complexity of the SPSM algorithm is O(N2), where N is the number of data points.

Download Full-text

A powerful and efficient evolutionary optimization algorithm based on stem cells algorithm for data clustering

Open Computer Science ◽

10.2478/s13537-012-0002-z ◽

2012 ◽

Vol 2 (1) ◽

Cited By ~ 3

Author(s):

Mohammad Taherdangkoo ◽

Mehran Yazdi ◽

Mohammad Bagheri

Keyword(s):

Stem Cells ◽

Optimization Algorithm ◽

Data Clustering ◽

High Speed ◽

Clustering Algorithms ◽

Implementation Process ◽

Pso Algorithm ◽

High Convergence Rate ◽

Data Points ◽

Number Of Classes

AbstractThere are many ways to divide datasets into some clusters. One of most popular data clustering algorithms is K-means algorithm which uses the distance criteria for measuring the data correlation. To do that, we should know in advance the number of classes (K) and choose K data points as an initial set to run the algorithm. However, the choice of initial points is a main problem in this algorithm which may cause the algorithm to converge to a local minimum. Some other data clustering algorithms have been proposed to overcome this problem. The methods are Genetic algorithm (GA), Ant Colony Optimization (ACO), PSO algorithm, and ABC algorithms. In this paper, we employ the Stem Cells Optimization algorithm for data clustering. The algorithm was inspired by behavior of natural stem cells in the human body. We developed a new data clustering based on this new optimization scheme which has the advantages such as high convergence rate and easy implementation process. It also avoids local minimums in an intelligent manner. The experimental results obtained by using the new algorithm on different well-known test datasets compared with those obtained using other mentioned methods demonstrate the better accuracy and high speed of the new algorithm.

Download Full-text

Serial Crystallography with Multi-stage Merging oi 1000’s of Images

10.1101/141770 ◽

2017 ◽

Author(s):

Herbert J. Bernstein ◽

Lawrence C. Andrews ◽

James Foadi ◽

Martin R. Fuchs ◽

Jean Jakoncic ◽

...

Keyword(s):

Data Clustering ◽

Clustering Algorithms ◽

Unit Cell ◽

Physical Parameters ◽

Data Sets ◽

Cell Parameters ◽

Cell Clustering ◽

Multi Stage ◽

Cluster Data ◽

Serial Crystallography

KAMO and Blend provide particularly effective tools to automatically manage the merging of large numbers of data sets from serial crystallography. The requirement for manual intervention in the process can be reduced by extending Blend to support additional clustering options to increase the sensitivity to differences in unit cell parameters and to allow for clustering of nearly complete datasets on the basis of intensity or amplitude differences. If datasets are already sufficiently complete to permit it, apply KAMO once, just for reflections. If starting from incomplete datasets, one applies KAMO twice, first using cell parameters. In this step either the simple cell vector distance of the original Blend is used, or the more sensitive NCDist, to find clusters to merge to achieve sufficient completeness to allow intensities or amplitudes to be compared. One then uses KAMO again using the correlation between the reflections at the common HKLs to merge clusters in a way sensitive to structural differences that may not perturb the cell parameters sufficiently to make meaningful clusters.Many groups have developed effective clustering algorithms that use a measurable physical parameter from each diffraction still or wedge to cluster the data into categories which can then be merged to, hopefully, yield the electron density from a single protein iso-form. What is striking about many of these physical parameters is that they are largely independent from one another. Consequently, it should be possible to greatly improve the efficacy of data clustering software by using a multi-stage partitioning strategy. Here, we have demonstrated one possible approach to multi-stage data clustering. Our strategy was to use unit-cell clustering until merged data was of sufficient completeness to then use intensity based clustering. We have demonstrated that, using this strategy, we were able to accurately cluster data sets from crystals that had subtle differences.

Download Full-text

Dynamic and Optimized Prototype Clustering for Relational Data based on Multiple Prototypes

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.l2795.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 5630-5634

Keyword(s):

Artificial Intelligence ◽

Data Clustering ◽

Synthetic Data ◽

Medical Data ◽

Relational Data ◽

Data Sets ◽

Optimization Approach ◽

Complex Task ◽

Data Points ◽

Clustering Approach

In artificial intelligence related applications such as bio-medical, bio-informatics, data clustering is an important and complex task with different situations. Prototype based clustering is the reasonable and simplicity to describe and evaluate data which can be treated as non-vertical representation of relational data. Because of Barycentric space present in prototype clustering, maintain and update the structure of the cluster with different data points is still challenging task for different data points in bio-medical relational data. So that in this paper we propose and introduce A Novel Optimized Evidential C-Medoids (NOEC) which is relates to family o prototype based clustering approach for update and proximity of medical relational data. We use Ant Colony Optimization approach to enable the services of similarity with different features for relational update cluster medical data. Perform our approach on different bio-medical related synthetic data sets. Experimental results of proposed approach give better and efficient results with comparison of different parameters in terms of accuracy and time with processing of medical relational data sets.

Download Full-text

Density-based clustering with constraints

Computer Science and Information Systems ◽

10.2298/csis180601007l ◽

2019 ◽

Vol 16 (2) ◽

pp. 469-489 ◽

Cited By ~ 1

Author(s):

Piotr Lasek ◽

Jarek Gryz

Keyword(s):

Data Clustering ◽

Clustering Algorithms ◽

Background Knowledge ◽

Data Sets ◽

Benchmark Data ◽

Density Based Clustering

In this paper we present our ic-NBC and ic-DBSCAN algorithms for data clustering with constraints. The algorithms are based on density-based clustering algorithms NBC and DBSCAN but allow users to incorporate background knowledge into the process of clustering by means of instance constraints. The knowledge about anticipated groups can be applied by specifying the so-called must-link and cannot-link relationships between objects or points. These relationships are then incorporated into the clustering process. In the proposed algorithms this is achieved by properly merging resulting clusters and introducing a new notion of deferred points which are temporarily excluded from clustering and assigned to clusters based on their involvement in cannot-link relationships. To examine the algorithms, we have carried out a number of experiments. We used benchmark data sets and tested the efficiency and quality of the results. We have also measured the efficiency of the algorithms against their original versions. The experiments prove that the introduction of instance constraints improves the quality of both algorithms. The efficiency is only insignificantly reduced and is due to extra computation related to the introduced constraints.

Download Full-text

Position Regularized Core Vector Machines

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.574.728 ◽

2014 ◽

Vol 574 ◽

pp. 728-733

Author(s):

Shu Xia Lu ◽

Cai Hong Jiao ◽

Le Tong ◽

Yang Fan Zhou

Keyword(s):

Large Data ◽

Experimental Results ◽

Large Data Sets ◽

Data Sets ◽

Benchmark Data ◽

Vector Machines ◽

Data Points ◽

Minimum Enclosing Ball ◽

Better Than

Core Vector Machine (CVM) can be used to deal with large data sets by find minimum enclosing ball (MEB), but one drawback is that CVM is very sensitive to the outliers. To tackle this problem, we propose a novel Position Regularized Core Vector Machine (PCVM).In the proposed PCVM, the data points are regularized by assigning a position-based weighting. Experimental results on several benchmark data sets show that the performance of PCVM is much better than CVM.

Download Full-text

Authorship Attribution with Topic Models

Computational Linguistics ◽

10.1162/coli_a_00173 ◽

2014 ◽

Vol 40 (2) ◽

pp. 269-310 ◽

Cited By ~ 31

Author(s):

Yanir Seroussi ◽

Ingrid Zukerman ◽

Fabian Bohnert

Keyword(s):

State Of The Art ◽

Topic Models ◽

Experimental Results ◽

Authorship Attribution ◽

Data Sets ◽

New Model ◽

On Line ◽

Art Performance

Authorship attribution deals with identifying the authors of anonymous texts. Traditionally, research in this field has focused on formal texts, such as essays and novels, but recently more attention has been given to texts generated by on-line users, such as e-mails and blogs. Authorship attribution of such on-line texts is a more challenging task than traditional authorship attribution, because such texts tend to be short, and the number of candidate authors is often larger than in traditional settings. We address this challenge by using topic models to obtain author representations. In addition to exploring novel ways of applying two popular topic models to this task, we test our new model that projects authors and documents to two disjoint topic spaces. Utilizing our model in authorship attribution yields state-of-the-art performance on several data sets, containing either formal texts written by a few authors or informal texts generated by tens to thousands of on-line users. We also present experimental results that demonstrate the applicability of topical author representations to two other problems: inferring the sentiment polarity of texts, and predicting the ratings that users would give to items such as movies.

Download Full-text