scholarly journals Nearest Neighbor-Based Clustering Algorithm for Large Data Sets

Author(s):  
Yadav Pankaj Kumar ◽  
Sriniwas Pandey ◽  
Mamata Samal ◽  
Mohanty Sraban Kumar
2011 ◽  
Vol 268-270 ◽  
pp. 811-816
Author(s):  
Yong Zhou ◽  
Yan Xing

Affinity Propagation(AP)is a new clustering algorithm, which is based on the similarity matrix between pairs of data points and messages are exchanged between data points until clustering result emerges. It is efficient and fast , and it can solve the clustering on large data sets. But the traditional Affinity Propagation has many limitations, this paper introduces the Affinity Propagation, and analyzes in depth the advantages and limitations of it, focuses on the improvements of the algorithm — improve the similarity matrix, adjust the preference and the damping-factor, combine with other algorithms. Finally, discusses the development of Affinity Propagation.


Author(s):  
V. Suresh Babu ◽  
P. Viswanath ◽  
Narasimha M. Murty

Non-parametric methods like the nearest neighbor classifier (NNC) and the Parzen-Window based density estimation (Duda, Hart & Stork, 2000) are more general than parametric methods because they do not make any assumptions regarding the probability distribution form. Further, they show good performance in practice with large data sets. These methods, either explicitly or implicitly estimates the probability density at a given point in a feature space by counting the number of points that fall in a small region around the given point. Popular classifiers which use this approach are the NNC and its variants like the k-nearest neighbor classifier (k-NNC) (Duda, Hart & Stock, 2000). Whereas the DBSCAN is a popular density based clustering method (Han & Kamber, 2001) which uses this approach. These methods show good performance, especially with larger data sets. Asymptotic error rate of NNC is less than twice the Bayes error (Cover & Hart, 1967) and DBSCAN can find arbitrary shaped clusters along with noisy outlier detection (Ester, Kriegel & Xu, 1996). The most prominent difficulty in applying the non-parametric methods for large data sets is its computational burden. The space and classification time complexities of NNC and k-NNC are O(n) where n is the training set size. The time complexity of DBSCAN is O(n2). So, these methods are not scalable for large data sets. Some of the remedies to reduce this burden are as follows. (1) Reduce the training set size by some editing techniques in order to eliminate some of the training patterns which are redundant in some sense (Dasarathy, 1991). For example, the condensed NNC (Hart, 1968) is of this type. (2) Use only a few selected prototypes from the data set. For example, Leaders-subleaders method and l-DBSCAN method are of this type (Vijaya, Murthy & Subramanian, 2004 and Viswanath & Rajwala, 2006). These two remedies can reduce the computational burden, but this can also result in a poor performance of the method. Using enriched prototypes can improve the performance as done in (Asharaf & Murthy, 2003) where the prototypes are derived using adaptive rough fuzzy set theory and as in (Suresh Babu & Viswanath, 2007) where the prototypes are used along with their relative weights. Using a few selected prototypes can reduce the computational burden. Prototypes can be derived by employing a clustering method like the leaders method (Spath, 1980), the k-means method (Jain, Dubes, & Chen, 1987), etc., which can find a partition of the data set where each block (cluster) of the partition is represented by a prototype called leader, centroid, etc. But these prototypes can not be used to estimate the probability density, since the density information present in the data set is lost while deriving the prototypes. The chapter proposes to use a modified leader clustering method called the counted-leader method which along with deriving the leaders preserves the crucial density information in the form of a count which can be used in estimating the densities. The chapter presents a fast and efficient nearest prototype based classifier called the counted k-nearest leader classifier (ck-NLC) which is on-par with the conventional k-NNC, but is considerably faster than the k-NNC. The chapter also presents a density based clustering method called l-DBSCAN which is shown to be a faster and scalable version of DBSCAN (Viswanath & Rajwala, 2006). Formally, under some assumptions, it is shown that the number of leaders is upper-bounded by a constant which is independent of the data set size and the distribution from which the data set is drawn.


2016 ◽  
pp. 1220-1243
Author(s):  
Ilias K. Savvas ◽  
Georgia N. Sofianidou ◽  
M-Tahar Kechadi

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.


2016 ◽  
Vol 2016 ◽  
pp. 1-9 ◽  
Author(s):  
Janusz Dudczyk

More advanced recognition methods, which may recognize particular copies of radars of the same type, are called identification. The identification process of radar devices is a more specialized task which requires methods based on the analysis of distinctive features. These features are distinguished from the signals coming from the identified devices. Such a process is called Specific Emitter Identification (SEI). The identification of radar emission sources with the use of classic techniques based on the statistical analysis of basic measurable parameters of a signal such as Radio Frequency, Amplitude, Pulse Width, or Pulse Repetition Interval is not sufficient for SEI problems. This paper presents the method of hierarchical data clustering which is used in the process of radar identification. The Hierarchical Agglomerative Clustering Algorithm (HACA) based on Generalized Agglomerative Scheme (GAS) implemented and used in the research method is parameterized; therefore, it is possible to compare the results. The results of clustering are presented in dendrograms in this paper. The received results of grouping and identification based on HACA are compared with other SEI methods in order to assess the degree of their usefulness and effectiveness for systems of ESM/ELINT class.


Author(s):  
M. EMRE CELEBI ◽  
HASSAN A. KINGRAVI

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. Many of these methods, however, have superlinear complexity in the number of data points, making them impractical for large data sets. On the other hand, linear methods are often random and/or order-sensitive, which renders their results unrepeatable. Recently, Su and Dy proposed two highly successful hierarchical initialization methods named Var-Part and PCA-Part that are not only linear, but also deterministic (nonrandom) and order-invariant. In this paper, we propose a discriminant analysis based approach that addresses a common deficiency of these two methods. Experiments on a large and diverse collection of data sets from the UCI machine learning repository demonstrate that Var-Part and PCA-Part are highly competitive with one of the best random initialization methods to date, i.e. k-means++, and that the proposed approach significantly improves the performance of both hierarchical methods.


2019 ◽  
Vol 12 (2) ◽  
pp. 140
Author(s):  
Retsi Firda Maulina ◽  
Anik Djuraidah ◽  
Anang Kurnia

Poverty is a complex and multidimensional problem so that it becomes a development priority. Applications of poverty modeling in discrete data are still few and applications of the Bayesian paradigm are also still few. The Bayes Method is a parameter estimation method that utilizes initial information (prior) and sample information so that it can provide predictions that have a higher accuracy than the classical methods. Bayes inference using INLA approach provides faster computation than MCMC and possible uses large data sets. This study aims to model Javanese poverty using the Bayesian Spatial Probit with the INLA approach with three weighting matrices, namely K-Nearest Neighbor (KNN), Inverse Distance, and Exponential Distance. Furthermore, the result showed poverty analysis in Java based on the best model is using Bayesian SAR Probit INLA with KNN weighting matrix produced the highest level of classification accuracy, with specificity is 85.45%, sensitivity is 93.75%, and accuracy is 89.92%.


Sign in / Sign up

Export Citation Format

Share Document