scholarly journals A FAST k-MEANS IMPLEMENTATION USING CORESETS

2008 ◽  
Vol 18 (06) ◽  
pp. 605-625 ◽  
Author(s):  
GEREON FRAHLING ◽  
CHRISTIAN SOHLER

In this paper we develop an efficient implementation for a k-means clustering algorithm. The algorithm is based on a combination of Lloyd's algorithm with random swapping of centers to avoid local minima. This approach was proposed by Mount 30. The novel feature of our algorithms is the use of coresets to speed up the algorithm. A coreset is a small weighted set of points that approximates the original point set with respect to the considered problem. We use a coreset construction described in 12. Our algorithm first computes a solution on a very small coreset. Then in each iteration the previous solution is used as a starting solution on a refined, i.e. larger, coreset. To evaluate the performance of our algorithm we compare it with algorithm KMHybrid 30 on typical 3D data sets for an image compression application and on artificially created instances. Our data sets consist of 300,000 to 4.9 million points. Our algorithm outperforms KMHybrid on most of these input instances. Additionally, the quality of the solutions computed by our algorithm deviates significantly less than that of KMHybrid. We conclude that the use of coresets has two effects. First, it can speed up algorithms significantly. Secondly, in variants of Lloyd's algorithm, it reduces the dependency on the starting solution and thus makes the algorithm more stable. Finally, we propose the use of coresets as a heuristic to approximate the average silhouette coefficient of clusterings. The average silhouette coefficient is a measure for the quality of a clustering that is independent of the number of clusters k. Hence, it can be used to compare the quality of clusterings for different sizes of k. To show the applicability of our approach we computed clusterings and approximate average silhouette coefficient for k = 1,…, 100 for our input instances and discuss the performance of our algorithm in detail.

2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2016 ◽  
Vol 16 (6) ◽  
pp. 27-42 ◽  
Author(s):  
Minghan Yang ◽  
Xuedong Gao ◽  
Ling Li

Abstract Although Clustering Algorithm Based on Sparse Feature Vector (CABOSFV) and its related algorithms are efficient for high dimensional sparse data clustering, there exist several imperfections. Such imperfections as subjective parameter designation and order sensibility of clustering process would eventually aggravate the time complexity and quality of the algorithm. This paper proposes a parameter adjustment method of Bidirectional CABOSFV for optimization purpose. By optimizing Parameter Vector (PV) and Parameter Selection Vector (PSV) with the objective function of clustering validity, an improved Bidirectional CABOSFV algorithm using simulated annealing is proposed, which circumvents the requirement of initial parameter determination. The experiments on UCI data sets show that the proposed algorithm, which can perform multi-adjustment clustering, has a higher accurateness than single adjustment clustering, along with a decreased time complexity through iterations.


2016 ◽  
Vol 25 (3) ◽  
pp. 431-440 ◽  
Author(s):  
Archana Purwar ◽  
Sandeep Kumar Singh

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.


2011 ◽  
Vol 20 (01) ◽  
pp. 139-177 ◽  
Author(s):  
YAN ZHOU ◽  
OLEKSANDR GRYGORASH ◽  
THOMAS F. HAIN

We propose two Euclidean minimum spanning tree based clustering algorithms — one a k-constrained, and the other an unconstrained algorithm. Our k-constrained clustering algorithm produces a k-partition of a set of points for any given k. The algorithm constructs a minimum spanning tree of a set of representative points and removes edges that satisfy a predefined criterion. The process is repeated until k clusters are produced. Our unconstrained clustering algorithm partitions a point set into a group of clusters by maximally reducing the overall standard deviation of the edges in the Euclidean minimum spanning tree constructed from a given point set, without prescribing the number of clusters. We present our experimental results comparing our proposed algorithms with k-means, X-means, CURE, Chameleon, and the Expectation-Maximization (EM) algorithm on both artificial data and benchmark data from the UCI repository. We also apply our algorithms to image color clustering and compare them with the standard minimum spanning tree clustering algorithm as well as CURE, Chameleon, and X-means.


2018 ◽  
Vol 2018 ◽  
pp. 1-8 ◽  
Author(s):  
Yaping Li ◽  
Zhiwei Ni ◽  
Feifei Jin ◽  
Jingming Li ◽  
Fenggang Li

As an important data analysis method in data mining, clustering analysis has been researched extensively and in depth. Aiming at the limitation of K-means clustering algorithm that it is sensitive to the distribution of initial clustering center, Glowworm Swarm Optimization (GSO) Algorithm is introduced to solve clustering problems. Firstly, this paper introduces the basic ideas of GSO algorithm, K-means algorithm, and good-point set and analyzes the feasibility of combining them for clustering optimization. Next, it designs a clustering method of improved GSO algorithm based on good-point set which combines GSO algorithm and classical K-means algorithm together, searches data object space, and provides initial clustering centers for K-means algorithm by means of improved GSO algorithm and thus obtains better clustering results. Major improvement of GSO algorithm is to optimize the initial distribution of glowworm swarm by introducing the theory and method of good-point set. Finally, the new clustering algorithm is applied to UCI data sets of different categories and numbers for clustering test. The advantages of the improved clustering algorithm in terms of sum of squared errors (SSE), clustering accuracy, and robustness are explained through comparison and analysis.


2013 ◽  
Vol 3 (2) ◽  
pp. 58-77
Author(s):  
Marlene Goncalves ◽  
Maria-Esther Vidal

Criteria that induce a Skyline naturally represent user's preference conditions useful to discard irrelevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size of the Skyline can still be very large, making unfeasible for users to process this set of points. To identify the best points among the Skyline, the Top-k Skyline approach has been proposed. Top-k Skyline uses discriminatory criteria to induce a total order of the points that comprise the Skyline, and recognizes the best or top-k points based on these criteria. In this article the authors model queries as multi-dimensional points that represent bounds of VPT (Vertically Partitioned Table) property values, and datasets as sets of multi-dimensional points; the problem is to locate the k best tuples in the dataset whose distance to the query is minimized. A tuple is among the k best tuples whenever there is not another tuple that is better in all dimensions, and that is closer to the query point, i.e., the k best tuples correspond to the k nearest points to the query that are incomparable or belong to the skyline. The authors name these tuples the k nearest neighbors in the skyline. The authors propose a hybrid approach that combines Skyline and Top-k solutions and develop two algorithms: TKSI and k-NNSkyline. The proposed algorithms identify among the skyline tuples, the k ones with the lowest values of the distance metric, i.e., the k nearest neighbors to the multi-dimensional query that are incomparable. Empirically, we study the performance and quality of TKSI and k-NNSkyline. The authors’ experimental results show the TKSI is able to speed up the computation of the Top-k Skyline in at least 50% percent with respect to the state-of-the-art solutions, whenever k is smaller than the size of the Skyline. Additionally, the authors’ results suggest that k-NNSkyline outperforms existing solutions by up to three orders of magnitude.


2018 ◽  
Vol 37 (1) ◽  
pp. 71
Author(s):  
Eduardo Sant'Ana Da Silva ◽  
Anderson Santos ◽  
Helio Pedrini

Surface approximation plays an important role in several application fields, such as computer-aided design, computer graphics, remote sensing, computer vision, robotics, architecture, and manufacturing. A common problem present in these areas is to develop efficient methods for generating, processing, analyzing, and visualizing large amount of 3D data. Triangular meshes constitute a flexible representation of sampled points that are not regularly distributed in space, such that the model can be adaptively adjusted to the data density. The choice of metrics for building the triangular meshes is crucial to produce high quality models. This paper proposes and evaluates different measures to incrementally refine a Delaunay triangular mesh for image surface approximation until either a certain accuracy is obtained or a maximum number of iterations is achieved. Experiments on several data sets are performed to compare the quality of the resulting meshes.


Author(s):  
Douglas L. Dorset

The quantitative use of electron diffraction intensity data for the determination of crystal structures represents the pioneering achievement in the electron crystallography of organic molecules, an effort largely begun by B. K. Vainshtein and his co-workers. However, despite numerous representative structure analyses yielding results consistent with X-ray determination, this entire effort was viewed with considerable mistrust by many crystallographers. This was no doubt due to the rather high crystallographic R-factors reported for some structures and, more importantly, the failure to convince many skeptics that the measured intensity data were adequate for ab initio structure determinations.We have recently demonstrated the utility of these data sets for structure analyses by direct phase determination based on the probabilistic estimate of three- and four-phase structure invariant sums. Examples include the structure of diketopiperazine using Vainshtein's 3D data, a similar 3D analysis of the room temperature structure of thiourea, and a zonal determination of the urea structure, the latter also based on data collected by the Moscow group.


2003 ◽  
Vol 42 (05) ◽  
pp. 215-219
Author(s):  
G. Platsch ◽  
A. Schwarz ◽  
K. Schmiedehausen ◽  
B. Tomandl ◽  
W. Huk ◽  
...  

Summary: Aim: Although the fusion of images from different modalities may improve diagnostic accuracy, it is rarely used in clinical routine work due to logistic problems. Therefore we evaluated performance and time needed for fusing MRI and SPECT images using a semiautomated dedicated software. Patients, material and Method: In 32 patients regional cerebral blood flow was measured using 99mTc ethylcystein dimer (ECD) and the three-headed SPECT camera MultiSPECT 3. MRI scans of the brain were performed using either a 0,2 T Open or a 1,5 T Sonata. Twelve of the MRI data sets were acquired using a 3D-T1w MPRAGE sequence, 20 with a 2D acquisition technique and different echo sequences. Image fusion was performed on a Syngo workstation using an entropy minimizing algorithm by an experienced user of the software. The fusion results were classified. We measured the time needed for the automated fusion procedure and in case of need that for manual realignment after automated, but insufficient fusion. Results: The mean time of the automated fusion procedure was 123 s. It was for the 2D significantly shorter than for the 3D MRI datasets. For four of the 2D data sets and two of the 3D data sets an optimal fit was reached using the automated approach. The remaining 26 data sets required manual correction. The sum of the time required for automated fusion and that needed for manual correction averaged 320 s (50-886 s). Conclusion: The fusion of 3D MRI data sets lasted significantly longer than that of the 2D MRI data. The automated fusion tool delivered in 20% an optimal fit, in 80% manual correction was necessary. Nevertheless, each of the 32 SPECT data sets could be merged in less than 15 min with the corresponding MRI data, which seems acceptable for clinical routine use.


2018 ◽  
Vol 26 (2) ◽  
pp. 131-143
Author(s):  
Marlinawati Marlinawati ◽  
Dewi Kusuma Wardani

The purpose of this research is to know the influence between the Quality of Human Resources, Utilization of Information Technology and Internal Control System Against Timeliness of Village Government Financial Reporting at Gunungkidul Regency. This research is causative research. The population is the village government in Gunungkidul Regency, especially in Gedangsari subdistrict. Criteria of respondents in the study were to village and village apparatus. We use questionnaire to collect data. We use multiple regression with SPSS program version 16.0 to analyze data. We find that quality of human resources and internal control system have a positive influence on the timeliness of village government financial reporting. On the other hand, utilization of information technology does not influence the timeliness of village government financial reporting. These imply that the quality of human resources and internal control system can speed up the preparation of village government financial reporting.


Sign in / Sign up

Export Citation Format

Share Document