Overview of Different Data Clustering Algorithms for Static and Dynamic Data Sets

Author(s):  
Johnsymol Joy
Author(s):  
B. K. Tripathy ◽  
Hari Seetha ◽  
M. N. Murty

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.


2017 ◽  
Author(s):  
Herbert J. Bernstein ◽  
Lawrence C. Andrews ◽  
James Foadi ◽  
Martin R. Fuchs ◽  
Jean Jakoncic ◽  
...  

KAMO and Blend provide particularly effective tools to automatically manage the merging of large numbers of data sets from serial crystallography. The requirement for manual intervention in the process can be reduced by extending Blend to support additional clustering options to increase the sensitivity to differences in unit cell parameters and to allow for clustering of nearly complete datasets on the basis of intensity or amplitude differences. If datasets are already sufficiently complete to permit it, apply KAMO once, just for reflections. If starting from incomplete datasets, one applies KAMO twice, first using cell parameters. In this step either the simple cell vector distance of the original Blend is used, or the more sensitive NCDist, to find clusters to merge to achieve sufficient completeness to allow intensities or amplitudes to be compared. One then uses KAMO again using the correlation between the reflections at the common HKLs to merge clusters in a way sensitive to structural differences that may not perturb the cell parameters sufficiently to make meaningful clusters.Many groups have developed effective clustering algorithms that use a measurable physical parameter from each diffraction still or wedge to cluster the data into categories which can then be merged to, hopefully, yield the electron density from a single protein iso-form. What is striking about many of these physical parameters is that they are largely independent from one another. Consequently, it should be possible to greatly improve the efficacy of data clustering software by using a multi-stage partitioning strategy. Here, we have demonstrated one possible approach to multi-stage data clustering. Our strategy was to use unit-cell clustering until merged data was of sufficient completeness to then use intensity based clustering. We have demonstrated that, using this strategy, we were able to accurately cluster data sets from crystals that had subtle differences.


2019 ◽  
Vol 16 (2) ◽  
pp. 469-489 ◽  
Author(s):  
Piotr Lasek ◽  
Jarek Gryz

In this paper we present our ic-NBC and ic-DBSCAN algorithms for data clustering with constraints. The algorithms are based on density-based clustering algorithms NBC and DBSCAN but allow users to incorporate background knowledge into the process of clustering by means of instance constraints. The knowledge about anticipated groups can be applied by specifying the so-called must-link and cannot-link relationships between objects or points. These relationships are then incorporated into the clustering process. In the proposed algorithms this is achieved by properly merging resulting clusters and introducing a new notion of deferred points which are temporarily excluded from clustering and assigned to clusters based on their involvement in cannot-link relationships. To examine the algorithms, we have carried out a number of experiments. We used benchmark data sets and tested the efficiency and quality of the results. We have also measured the efficiency of the algorithms against their original versions. The experiments prove that the introduction of instance constraints improves the quality of both algorithms. The efficiency is only insignificantly reduced and is due to extra computation related to the introduced constraints.


Author(s):  
Rina Refianti ◽  
Achmad Benny Mutiara ◽  
Asep Juarna ◽  
Adang Suhendra

In recent years, two new data clustering algorithms have been proposed. One of them isAffinity Propagation (AP). AP is a new data clustering technique that use iterative message passing and consider all data points as potential exemplars. Two important inputs of AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although the original AP algorithm has shown much success in data clustering, it still suffer from one limitation: it is not easy to determine the value of the parameter ”preference” p which can result an optimal clustering solution. To resolve this limitation, we propose a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running. MAAP procedure means that we omit the adaptive p-scanning algorithm as in original Adaptive-AP (AAP) procedure. Experimental results on random non-partition and partition data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are comparable with those original AP. Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure.


Author(s):  
Wenyuan Li

With the rapid growth of the World Wide Web and the capacity of digital data storage, tremendous amount of data are generated daily from business and engineering to the Internet and science. The Internet, financial realtime data, hyperspectral imagery, and DNA microarrays are just a few of the common sources that feed torrential streams of data into scientific and business databases worldwide. Compared to statistical data sets with small size and low dimensionality, traditional clustering techniques are challenged by such unprecedented high volume, high dimensionality complex data. To meet these challenges, many new clustering algorithms have been proposed in the area of data mining (Han & Kambr, 2001). Spectral techniques have proven useful and effective in a variety of data mining and information retrieval applications where massive amount of real-life data is available (Deerwester et al., 1990; Kleinberg, 1998; Lawrence et al., 1999; Azar et al., 2001). In recent years, a class of promising and increasingly popular approaches — spectral methods — has been proposed in the context of clustering task (Shi & Malik, 2000; Kannan et al., 2000; Meila & Shi, 2001; Ng et al., 2001). Spectral methods have the following reasons to be an attractive approach to clustering problem: • Spectral approaches to the clustering problem offer the potential for dramatic improvements in efficiency and accuracy relative to traditional iterative or greedy algorithms. They do not intrinsically suffer from the problem of local optima. • Numerical methods for spectral computations are extremely mature and well understood, allowing clustering algorithms to benefit from a long history of implementation efficiencies in other fields (Golub & Loan, 1996). • Components in spectral methods have the naturally close relationship with graphs (Chung, 1997). This characteristic provides an intuitive and semantic understanding of elements in spectral methods. It Spectral Methods for Data Clustering Wenyuan Li Nanyang Technological University, Singapore Wee Keong Ng Nanyang Technological University, Singapore is important when the data is graph-based, such as links of WWW, or can be converted to graphs. In this paper, we systematically discuss applications of spectral methods to data clustering.


2019 ◽  
Vol 34 (S1) ◽  
pp. S36-S41
Author(s):  
M. Bortolotti ◽  
L. Lutterotti ◽  
E. Borovin ◽  
D. Martorelli

X-ray diffraction-X-ray fluorescence (XRD-XRF) data sets obtained from surface scans of synthetic samples have been analysed by means of different data clustering algorithms, with the aim to propose a methodology for automatic crystallographic and chemical classification of surfaces. Three data clustering strategies have been evaluated, namely hierarchical, k-means, and density-based clustering; all of them have been applied to the distance matrix calculated from the single XRD and XRF data sets as well as the combined distance matrix. Classification performance is reported for each strategy both in numerical form as the corrected Rand index and as a visual reconstruction of the surface maps. Hierarchical and k-means clustering offered comparable results, depending on both sample complexity and data quality. When applied to XRF data collected on a two-phases test sample, both algorithms allowed to obtain Rand index values above 0.8, whereas XRD data collected on the same sample gave values around 0.5; application to the combined distance matrix improved the correlation to about 0.9. In the case of a more complex multi-phase sample, it has also been found that classification performance strongly depends on both data quality and signal contrast between different regions; again, the adoption of the combined dissimilarity matrix offered improved classification performance.


2012 ◽  
Vol 256-259 ◽  
pp. 2935-2938
Author(s):  
Xin Xiao

An improved resource limited artificial immune algorithm is proposed, which is applied to the incremental data clustering. The algorithm adopts an improved function of stimulation level, allowing the system to distributing resources more reasonably. And the algorithm introduces the immune response model, simulating the initial response and the secondary response. Through simulation experiments on the UCI data sets, and comparison with the artificial immune based influential incremental clustering algorithms, it shows that the algorithm is effective, can extract data characteristics, and increases the clustering accuracy and data compression rate.


2019 ◽  
Vol 8 (2S11) ◽  
pp. 3687-3693

Clustering is a type of mining process where the data set is categorized into various sub classes. Clustering process is very much essential in classification, grouping, and exploratory pattern of analysis, image segmentation and decision making. And we can explain about the big data as very large data sets which are examined computationally to show techniques and associations and also which is associated to the human behavior and their interactions. Big data is very essential for several organisations but in few cases very complex to store and it is also time saving. Hence one of the ways of overcoming these issues is to develop the many clustering methods, moreover it suffers from the large complexity. Data mining is a type of technique where the useful information is extracted, but the data mining models cannot utilized for the big data because of inherent complexity. The main scope here is to introducing a overview of data clustering divisions for the big data And also explains here few of the related work for it. This survey concentrates on the research of several clustering algorithms which are working basically on the elements of big data. And also the short overview of clustering algorithms which are grouped under partitioning, hierarchical, grid based and model based are seenClustering is major data mining and it is used for analyzing the big data.the problems for applying clustering patterns to big data and also we phase new issues come up with big data


Author(s):  
Yuancheng Li ◽  
Yaqi Cui ◽  
Xiaolong Zhang

Background: Advanced Metering Infrastructure (AMI) for the smart grid is growing rapidly which results in the exponential growth of data collected and transmitted in the device. By clustering this data, it can give the electricity company a better understanding of the personalized and differentiated needs of the user. Objective: The existing clustering algorithms for processing data generally have some problems, such as insufficient data utilization, high computational complexity and low accuracy of behavior recognition. Methods: In order to improve the clustering accuracy, this paper proposes a new clustering method based on the electrical behavior of the user. Starting with the analysis of user load characteristics, the user electricity data samples were constructed. The daily load characteristic curve was extracted through improved extreme learning machine clustering algorithm and effective index criteria. Moreover, clustering analysis was carried out for different users from industrial areas, commercial areas and residential areas. The improved extreme learning machine algorithm, also called Unsupervised Extreme Learning Machine (US-ELM), is an extension and improvement of the original Extreme Learning Machine (ELM), which realizes the unsupervised clustering task on the basis of the original ELM. Results: Four different data sets have been experimented and compared with other commonly used clustering algorithms by MATLAB programming. The experimental results show that the US-ELM algorithm has higher accuracy in processing power data. Conclusion: The unsupervised ELM algorithm can greatly reduce the time consumption and improve the effectiveness of clustering.


Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 786
Author(s):  
Yenny Villuendas-Rey ◽  
Eley Barroso-Cubas ◽  
Oscar Camacho-Nieto ◽  
Cornelio Yáñez-Márquez

Swarm intelligence has appeared as an active field for solving numerous machine-learning tasks. In this paper, we address the problem of clustering data with missing values, where the patterns are described by mixed (or hybrid) features. We introduce a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We experimentally obtain the adequate values of the parameters for these three modified algorithms, with the purpose of applying them in the clustering task. We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.


Sign in / Sign up

Export Citation Format

Share Document