Speeding-up the prototype based kernel k-means clustering method for large data sets

Non-parametric methods like the nearest neighbor classifier (NNC) and the Parzen-Window based density estimation (Duda, Hart & Stork, 2000) are more general than parametric methods because they do not make any assumptions regarding the probability distribution form. Further, they show good performance in practice with large data sets. These methods, either explicitly or implicitly estimates the probability density at a given point in a feature space by counting the number of points that fall in a small region around the given point. Popular classifiers which use this approach are the NNC and its variants like the k-nearest neighbor classifier (k-NNC) (Duda, Hart & Stock, 2000). Whereas the DBSCAN is a popular density based clustering method (Han & Kamber, 2001) which uses this approach. These methods show good performance, especially with larger data sets. Asymptotic error rate of NNC is less than twice the Bayes error (Cover & Hart, 1967) and DBSCAN can find arbitrary shaped clusters along with noisy outlier detection (Ester, Kriegel & Xu, 1996). The most prominent difficulty in applying the non-parametric methods for large data sets is its computational burden. The space and classification time complexities of NNC and k-NNC are O(n) where n is the training set size. The time complexity of DBSCAN is O(n2). So, these methods are not scalable for large data sets. Some of the remedies to reduce this burden are as follows. (1) Reduce the training set size by some editing techniques in order to eliminate some of the training patterns which are redundant in some sense (Dasarathy, 1991). For example, the condensed NNC (Hart, 1968) is of this type. (2) Use only a few selected prototypes from the data set. For example, Leaders-subleaders method and l-DBSCAN method are of this type (Vijaya, Murthy & Subramanian, 2004 and Viswanath & Rajwala, 2006). These two remedies can reduce the computational burden, but this can also result in a poor performance of the method. Using enriched prototypes can improve the performance as done in (Asharaf & Murthy, 2003) where the prototypes are derived using adaptive rough fuzzy set theory and as in (Suresh Babu & Viswanath, 2007) where the prototypes are used along with their relative weights. Using a few selected prototypes can reduce the computational burden. Prototypes can be derived by employing a clustering method like the leaders method (Spath, 1980), the k-means method (Jain, Dubes, & Chen, 1987), etc., which can find a partition of the data set where each block (cluster) of the partition is represented by a prototype called leader, centroid, etc. But these prototypes can not be used to estimate the probability density, since the density information present in the data set is lost while deriving the prototypes. The chapter proposes to use a modified leader clustering method called the counted-leader method which along with deriving the leaders preserves the crucial density information in the form of a count which can be used in estimating the densities. The chapter presents a fast and efficient nearest prototype based classifier called the counted k-nearest leader classifier (ck-NLC) which is on-par with the conventional k-NNC, but is considerably faster than the k-NNC. The chapter also presents a density based clustering method called l-DBSCAN which is shown to be a faster and scalable version of DBSCAN (Viswanath & Rajwala, 2006). Formally, under some assumptions, it is shown that the number of leaders is upper-bounded by a constant which is independent of the data set size and the distribution from which the data set is drawn.

Download Full-text

Scaling up Kernel Grower Clustering Method for Large Data Sets via Core-sets

Acta Automatica Sinica ◽

10.3724/sp.j.1004.2008.00376 ◽

2008 ◽

Vol 34 (3) ◽

pp. 376-382 ◽

Cited By ~ 10

Author(s):

Liang CHANG ◽

Xiao-Ming DENG ◽

Sui-Wu ZHENG ◽

Yong-Qing WANG

Keyword(s):

Large Data ◽

Scaling Up ◽

Large Data Sets ◽

Data Sets ◽

Clustering Method ◽

Core Sets

Download Full-text

A fast approximate kernel k-means clustering method for large data sets

2011 IEEE Recent Advances in Intelligent Computational Systems ◽

10.1109/raics.2011.6069372 ◽

2011 ◽

Cited By ~ 3

Author(s):

T. Hitendra Sarma ◽

P. Viswanath ◽

B. Eswara Reddy

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Method

Download Full-text

Grid-clustering: an efficient hierarchical clustering method for very large data sets

Proceedings of 13th International Conference on Pattern Recognition ◽

10.1109/icpr.1996.546732 ◽

1996 ◽

Cited By ~ 63

Author(s):

E. Schikuta

Keyword(s):

Hierarchical Clustering ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Method

Download Full-text

Clustering Mixed Incomplete Data

Heuristic and Optimization for Knowledge Discovery ◽

10.4018/978-1-930708-26-6.ch006 ◽

2011 ◽

pp. 89-106

Author(s):

Jose Ruiz-Shulcloper ◽

Guillermo Sanchez-Diaz ◽

Mongi A. Abidi

Keyword(s):

Pattern Recognition ◽

Incomplete Data ◽

Clustering Algorithms ◽

Large Data ◽

Unsupervised Classification ◽

Large Data Sets ◽

Data Sets ◽

Classification Problems ◽

Clustering Method ◽

Combinatorial Pattern

In this chapter, we expose the possibilities of the Logical Combinatorial Pattern Recognition (LCPR) tools for Clustering Large and Very Large Mixed Incomplete Data (MID) Sets. We start from the real existence of a number of complex structures of large or very large data sets. Our research is directed towards the application of methods, techniques and in general, the philosophy of the LCPR to the solution of supervised and unsupervised classification problems. In this chapter, we introduce the GLC and DGLC clustering algorithms and the GLC+ clustering method in order to process large and very large mixed incomplete data sets.

Download Full-text

An Efficient and Fast Parzen-Window Density Based Clustering Method for Large Data Sets

2008 First International Conference on Emerging Trends in Engineering and Technology ◽

10.1109/icetet.2008.166 ◽

2008 ◽

Cited By ~ 2

Author(s):

V. Suresh Babu ◽

P. Viswanath

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Method ◽

Parzen Window ◽

Density Based Clustering

Download Full-text

An Agglomerative Clustering Method for Large Data Sets

International Journal of Computer Applications ◽

10.5120/16074-4952 ◽

2014 ◽

Vol 92 (14) ◽

pp. 1-7 ◽

Cited By ~ 6

Author(s):

Omar Kettani ◽

Faycal Ramdani ◽

Benaissa Tadili

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Agglomerative Clustering ◽

Clustering Method

Download Full-text

A Fast Spectral Clustering Method Based on Growing Vector Quantization for Large Data Sets

Advanced Data Mining and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-642-53917-6_3 ◽

2013 ◽

pp. 25-33

Author(s):

Xiujun Wang ◽

Xiao Zheng ◽

Feng Qin ◽

Baohua Zhao

Keyword(s):

Vector Quantization ◽

Spectral Clustering ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Method ◽

Spectral Clustering Method

Download Full-text

Rough-DBSCAN: A fast hybrid density based clustering method for large data sets

Pattern Recognition Letters ◽

10.1016/j.patrec.2009.08.008 ◽

2009 ◽

Vol 30 (16) ◽

pp. 1477-1488 ◽

Cited By ~ 69

Author(s):

P. Viswanath ◽

V. Suresh Babu

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Method ◽

Density Based Clustering

Download Full-text

An example of spectrum imaging used for comparison of EELS quantitative analysis techniques on Al-Li

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010008794x ◽

1991 ◽

Vol 49 ◽

pp. 726-727

Author(s):

John A. Hunt

Keyword(s):

Quantitative Analysis ◽

Large Data ◽

Difference Spectrum ◽

Large Data Sets ◽

Foil Thickness ◽

Data Sets ◽

Analysis Techniques ◽

Spectrum Imaging ◽

Normal Spectrum ◽

Electron Energy Loss

Spectrum-imaging is a useful technique for comparing different processing methods on very large data sets which are identical for each method. This paper is concerned with comparing methods of electron energy-loss spectroscopy (EELS) quantitative analysis on the Al-Li system. The spectrum-image analyzed here was obtained from an Al-10at%Li foil aged to produce δ' precipitates that can span the foil thickness. Two 1024 channel EELS spectra offset in energy by 1 eV were recorded and stored at each pixel in the 80x80 spectrum-image (25 Mbytes). An energy range of 39-89eV (20 channels/eV) are represented. During processing the spectra are either subtracted to create an artifact corrected difference spectrum, or the energy offset is numerically removed and the spectra are added to create a normal spectrum. The spectrum-images are processed into 2D floating-point images using methods and software described in [1].

Download Full-text