A FAST k-MEANS IMPLEMENTATION USING CORESETS
In this paper we develop an efficient implementation for a k-means clustering algorithm. The algorithm is based on a combination of Lloyd's algorithm with random swapping of centers to avoid local minima. This approach was proposed by Mount 30. The novel feature of our algorithms is the use of coresets to speed up the algorithm. A coreset is a small weighted set of points that approximates the original point set with respect to the considered problem. We use a coreset construction described in 12. Our algorithm first computes a solution on a very small coreset. Then in each iteration the previous solution is used as a starting solution on a refined, i.e. larger, coreset. To evaluate the performance of our algorithm we compare it with algorithm KMHybrid 30 on typical 3D data sets for an image compression application and on artificially created instances. Our data sets consist of 300,000 to 4.9 million points. Our algorithm outperforms KMHybrid on most of these input instances. Additionally, the quality of the solutions computed by our algorithm deviates significantly less than that of KMHybrid. We conclude that the use of coresets has two effects. First, it can speed up algorithms significantly. Secondly, in variants of Lloyd's algorithm, it reduces the dependency on the starting solution and thus makes the algorithm more stable. Finally, we propose the use of coresets as a heuristic to approximate the average silhouette coefficient of clusterings. The average silhouette coefficient is a measure for the quality of a clustering that is independent of the number of clusters k. Hence, it can be used to compare the quality of clusterings for different sizes of k. To show the applicability of our approach we computed clusterings and approximate average silhouette coefficient for k = 1,…, 100 for our input instances and discuss the performance of our algorithm in detail.