The Influence of Hubness on NN-Descent

The K-nearest neighbor graph (K-NNG) is a data structure used by many machine-learning algorithms. Naive computation of the K-NNG has quadratic time complexity, which in many cases is not efficient enough, producing the need for fast and accurate approximation algorithms. NN-Descent is one such algorithm that is highly efficient, but has a major drawback in that K-NNG approximations are accurate only on data of low intrinsic dimensionality. This paper represents an experimental analysis of this behavior, and investigates possible solutions. Experimental results show that there is a link between the performance of NN-Descent and the phenomenon of hubness, defined as the tendency of intrinsically high-dimensional data to contain hubs – points with large in-degrees in the K-NNG. First, we explain how the presence of the hubness phenomenon causes bad NN-Descent performance. In light of that, we propose four NN-Descent variants to alleviate the observed negative inuence of hubs. By evaluating the proposed approaches on several real and synthetic data sets, we conclude that our approaches are more accurate, but often at the cost of higher scan rates.

Download Full-text

Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets

IEICE Transactions on Information and Systems ◽

10.1587/transinf.2014edp7108 ◽

2014 ◽

Vol E97.D (12) ◽

pp. 3142-3154 ◽

Cited By ~ 1

Author(s):

Tomohiro WARASHINA ◽

Kazuo AOYAMA ◽

Hiroshi SAWADA ◽

Takashi HATTORI

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Data Sets ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Large Scale Data ◽

Nearest Neighbor Graph ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Sublinear time approximation of the cost of a metric k-nearest neighbor graph

Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms ◽

10.1137/1.9781611975994.180 ◽

2020 ◽

pp. 2973-2992 ◽

Cited By ~ 1

Author(s):

Artur Czumaj ◽

Christian Sohler

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Time Approximation ◽

Neighbor Graph ◽

Nearest Neighbor Graph ◽

The Cost

Download Full-text

DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification

Vegetation of Russia ◽

10.31111/vegrus/2015.27.125 ◽

2015 ◽

pp. 125-138 ◽

Cited By ~ 2

Author(s):

I. V. Goncharenko

Keyword(s):

Cluster Analysis ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Protein Structures ◽

Hierarchical Cluster ◽

Vegetation Classification ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classiﬁcation was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.

Download Full-text

A Novel clustering method based on hybrid K-nearest-neighbor graph

Pattern Recognition ◽

10.1016/j.patcog.2017.09.008 ◽

2018 ◽

Vol 74 ◽

pp. 1-14 ◽

Cited By ~ 19

Author(s):

Yikun Qin ◽

Zhu Liang Yu ◽

Chang-Dong Wang ◽

Zhenghui Gu ◽

Yuanqing Li

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Clustering Method ◽

Neighbor Graph ◽

Nearest Neighbor Graph

Download Full-text

Discovery of Regional Co-location Patterns with k-Nearest Neighbor Graph

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-642-37453-1_15 ◽

2013 ◽

pp. 174-186 ◽

Cited By ~ 3

Author(s):

Feng Qian ◽

Kevin Chiew ◽

Qinming He ◽

Hao Huang ◽

Lianhang Ma

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Location Patterns ◽

Nearest Neighbor Graph

Download Full-text

Data Clustering Based on Community Structure in Mutual k-Nearest Neighbor Graph

2018 41st International Conference on Telecommunications and Signal Processing (TSP) ◽

10.1109/tsp.2018.8441226 ◽

2018 ◽

Author(s):

Honglei Zhang ◽

Serkan Kiranyaz ◽

Moncef Gabbouj

Keyword(s):

Community Structure ◽

Data Clustering ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

Download Full-text

Using the k-Nearest Neighbor Graph for Proximity Searching in Metric Spaces

String Processing and Information Retrieval - Lecture Notes in Computer Science ◽

10.1007/11575832_14 ◽

2005 ◽

pp. 127-138 ◽

Cited By ~ 14

Author(s):

Rodrigo Paredes ◽

Edgar Chávez

Keyword(s):

Metric Spaces ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

Download Full-text

Region-Based Graph Learning towards Large Scale Image Annotation

Graph-Based Methods in Computer Vision ◽

10.4018/978-1-4666-1891-6.ch013 ◽

2012 ◽

pp. 244-260

Author(s):

Bao Bing-Kun ◽

Yan Shuicheng

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Image Annotation ◽

Learning Algorithm ◽

Label Propagation ◽

Locality Sensitive Hashing ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph ◽

Modeling Data

Graph-based learning provides a useful approach for modeling data in image annotation problems. In this chapter, the authors introduce how to construct a region-based graph to annotate large scale multi-label images. It has been well recognized that analysis in semantic region level may greatly improve image annotation performance compared to that in whole image level. However, the region level approach increases the data scale to several orders of magnitude and lays down new challenges to most existing algorithms. To this end, each image is firstly encoded as a Bag-of-Regions based on multiple image segmentations. And then, all image regions are constructed into a large k-nearest-neighbor graph with efficient Locality Sensitive Hashing (LSH) method. At last, a sparse and region-aware image-based graph is fed into the multi-label extension of the Entropic graph regularized semi-supervised learning algorithm (Subramanya & Bilmes, 2009). In combination they naturally yield the capability in handling large-scale dataset. Extensive experiments on NUS-WIDE (260k images) and COREL-5k datasets well validate the effectiveness and efficiency of the framework for region-aware and scalable multi-label propagation.

Download Full-text

Dimensionality Reduction with Unsupervised Feature Selection and Applying Non-Euclidean Norms for Classification Accuracy

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch006 ◽

2011 ◽

pp. 91-109

Author(s):

Amit Saxena ◽

John Wang

Keyword(s):

Classification Accuracy ◽

Nearest Neighbor ◽

Fitness Function ◽

Synthetic Data ◽

Feature Subset Selection ◽

Second Phase ◽

Data Sets ◽

Feature Subset ◽

K Nearest Neighbor ◽

Two Phase

This paper presents a two-phase scheme to select reduced number of features from a dataset using Genetic Algorithm (GA) and testing the classification accuracy (CA) of the dataset with the reduced feature set. In the first phase of the proposed work, an unsupervised approach to select a subset of features is applied. GA is used to select stochastically reduced number of features with Sammon Error as the fitness function. Different subsets of features are obtained. In the second phase, each of the reduced features set is applied to test the CA of the dataset. The CA of a data set is validated using supervised k-nearest neighbor (k-nn) algorithm. The novelty of the proposed scheme is that each reduced feature set obtained in the first phase is investigated for CA using the k-nn classification with different Minkowski metric i.e. non-Euclidean norms instead of conventional Euclidean norm (L2). Final results are presented in the paper with extensive simulations on seven real and one synthetic, data sets. It is revealed from the proposed investigation that taking different norms produces better CA and hence a scope for better feature subset selection.

Download Full-text

A Generic Algorithm for k-Nearest Neighbor Graph Construction Based on Balanced Canopy Clustering

KIISE Transactions on Computing Practices ◽

10.5626/ktcp.2015.21.4.327 ◽

2015 ◽

Vol 21 (4) ◽

pp. 327-332

Author(s):

Youngki Park ◽

Heasoo Hwang ◽

Sang-Goo Lee

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Generic Algorithm ◽

Neighbor Graph ◽

Nearest Neighbor Graph

Download Full-text