Bounds for discrepancies in the Hamming space

This chapter focuses on data searching, which is nowadays mostly based on similarity. The similarity search is challenging due to its computational complexity, and also the fact that similarity is subjective and context dependent. The authors assume the metric space model of similarity, defined by the domain of objects and the metric function that measures the dissimilarity of object pairs. The volume of contemporary data is large, and the time efficiency of similarity query executions is essential. This chapter investigates transformations of metric space to Hamming space to decrease the memory and computational complexity of the search. Various challenges of the similarity search with sketches in the Hamming space are addressed, including the definition of sketching transformation and efficient search algorithms that exploit sketches to speed-up searching. The indexing of Hamming space and a heuristic to facilitate the selection of a suitable sketching technique for any given application are also considered.

Download Full-text

Isometries of the Hamming Space and Equivalence Relations of Linear Codes Over a Finite Field

SEMA SIMAI Springer Series - Trends in Differential Equations and Applications ◽

10.1007/978-3-319-32013-7_12 ◽

2016 ◽

pp. 203-219

Author(s):

M. Isabel García-Planas ◽

M. Dolors Magret

Keyword(s):

Finite Field ◽

Linear Codes ◽

Equivalence Relations ◽

Hamming Space

Download Full-text

Generalizing the Pigeonhole Principle for Similarity Search in Hamming Space

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2019.2899597 ◽

2019 ◽

pp. 1-1 ◽

Cited By ~ 2

Author(s):

Jianbin Qin ◽

Chuan Xiao ◽

Yaoshu Wang ◽

Wei Wang ◽

Xuemin Lin ◽

...

Keyword(s):

Similarity Search ◽

Pigeonhole Principle ◽

Hamming Space

Download Full-text

Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences

BMC Bioinformatics ◽

10.1186/s12859-020-03811-z ◽

2020 ◽

Vol 21 (S18) ◽

Author(s):

David S. Campo ◽

Yury Khudyakov

Keyword(s):

Convex Hull ◽

Categorical Data ◽

Clustering Algorithm ◽

Hamming Distance ◽

Pairwise Comparison ◽

Comparison Method ◽

Convex Hulls ◽

Large Set ◽

Clustering Methods ◽

Hamming Space

Abstract Background In molecular epidemiology, comparison of intra-host viral variants among infected persons is frequently used for tracing transmissions in human population and detecting viral infection outbreaks. Application of Ultra-Deep Sequencing (UDS) immensely increases the sensitivity of transmission detection but brings considerable computational challenges when comparing all pairs of sequences. We developed a new population comparison method based on convex hulls in hamming space. We applied this method to a large set of UDS samples obtained from unrelated cases infected with hepatitis C virus (HCV) and compared its performance with three previously published methods. Results The convex hull in hamming space is a data structure that provides information on: (1) average hamming distance within the set, (2) average hamming distance between two sets; (3) closeness centrality of each sequence; and (4) lower and upper bound of all the pairwise distances among the members of two sets. This filtering strategy rapidly and correctly removes 96.2% of all pairwise HCV sample comparisons, outperforming all previous methods. The convex hull distance (CHD) algorithm showed variable performance depending on sequence heterogeneity of the studied populations in real and simulated datasets, suggesting the possibility of using clustering methods to improve the performance. To address this issue, we developed a new clustering algorithm, k-hulls, that reduces heterogeneity of the convex hull. This efficient algorithm is an extension of the k-means algorithm and can be used with any type of categorical data. It is 6.8-times more accurate than k-mode, a previously developed clustering algorithm for categorical data. Conclusions CHD is a fast and efficient filtering strategy for massively reducing the computational burden of pairwise comparison among large samples of sequences, and thus, aiding the calculation of transmission links among infected individuals using threshold-based methods. In addition, the convex hull efficiently obtains important summary metrics for intra-host viral populations.

Download Full-text

A Fast Approximate Nearest Neighbor Search Algorithm in the Hamming Space

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/tpami.2012.170 ◽

2012 ◽

Vol 34 (12) ◽

pp. 2481-2488 ◽

Cited By ~ 31

Author(s):

Mani Malek Esmaeili ◽

R. K. Ward ◽

M. Fatourechi

Keyword(s):

Nearest Neighbor ◽

Search Algorithm ◽

Nearest Neighbor Search ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Hamming Space

Download Full-text

Low Distortion Embedding of the Hamming Space into a Sphere with Quadrance Metric and k-means Clustering of Nominal-continuous Data

Fundamenta Informaticae ◽

10.3233/fi-2017-1538 ◽

2017 ◽

Vol 153 (3) ◽

pp. 221-233 ◽

Cited By ~ 1

Author(s):

Aleksander Denisiuk ◽

Michał Grabowski

Keyword(s):

Continuous Data ◽

Hamming Space ◽

Low Distortion

Download Full-text

Optimized K-Means Hashing for Approximate Nearest Neighbor Search

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.651-653.2168 ◽

2014 ◽

Vol 651-653 ◽

pp. 2168-2171

Author(s):

Qin Zhen Guo ◽

Zhi Zeng ◽

Shu Wu Zhang ◽

Yuan Zhang ◽

Gui Xuan Zhang

Keyword(s):

High Efficiency ◽

Nearest Neighbor ◽

Quantization Error ◽

Nearest Neighbor Search ◽

Binary Codes ◽

Neighborhood Structure ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Hamming Space

Hashing which maps data into binary codes in Hamming space has attracted more and more attentions for approximate nearest neighbor search due to its high efficiency and reduced storage cost. K-means hashing (KH) is a novel hashing method which firstly quantizes the data by codewords and then uses the indices of codewords to encode the data. However, in KH, only the codewords are updated to minimize the quantization error and affinity error while the indices of codewords remain the same after they are initialized. In this paper, we propose an optimized k-means hashing (OKH) method to encode data by binary codes. In our method, we simultaneously optimize the codewords and the indices of them to minimize the quantization error and the affinity error. Our OKH method can find both the optimal codewords and the optiaml indices, and the resulting binary codes in Hamming space can better preserve the original neighborhood structure of the data. Besides, OKH can further be generalized to a product space. Extensive experiments have verified the superiority of OKH over KH and other state-of-the-art hashing methods.

Download Full-text