Theoretical analysis on pruning nearest neighbor candidates by locality sensitive hashing

Graph-based learning provides a useful approach for modeling data in image annotation problems. In this chapter, the authors introduce how to construct a region-based graph to annotate large scale multi-label images. It has been well recognized that analysis in semantic region level may greatly improve image annotation performance compared to that in whole image level. However, the region level approach increases the data scale to several orders of magnitude and lays down new challenges to most existing algorithms. To this end, each image is firstly encoded as a Bag-of-Regions based on multiple image segmentations. And then, all image regions are constructed into a large k-nearest-neighbor graph with efficient Locality Sensitive Hashing (LSH) method. At last, a sparse and region-aware image-based graph is fed into the multi-label extension of the Entropic graph regularized semi-supervised learning algorithm (Subramanya & Bilmes, 2009). In combination they naturally yield the capability in handling large-scale dataset. Extensive experiments on NUS-WIDE (260k images) and COREL-5k datasets well validate the effectiveness and efficiency of the framework for region-aware and scalable multi-label propagation.

Download Full-text

Improving the Performance of kNN in the MapReduce Framework Using Locality Sensitive Hashing

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2019100101 ◽

2019 ◽

Vol 10 (4) ◽

pp. 1-16

Author(s):

Sikha Bagui ◽

Arup Kumar Mondal ◽

Subhash Bagui

Keyword(s):

Nearest Neighbor ◽

Parallel Implementation ◽

Block Size ◽

Computation Time ◽

Locality Sensitive Hashing ◽

K Nearest Neighbor ◽

Mapreduce Framework ◽

Data Set ◽

Data Object ◽

Very Large Datasets

In this work the authors present a parallel k nearest neighbor (kNN) algorithm using locality sensitive hashing to preprocess the data before it is classified using kNN in Hadoop's MapReduce framework. This is compared with the sequential (conventional) implementation. Using locality sensitive hashing's similarity measure with kNN, the iterative procedure to classify a data object is performed within a hash bucket rather than the whole data set, greatly reducing the computation time needed for classification. Several experiments were run that showed that the parallel implementation performed better than the sequential implementation on very large datasets. The study also experimented with a few map and reduce side optimization features for the parallel implementation and presented some optimum map and reduce side parameters. Among the map side parameters, the block size and input split size were varied, and among the reduce side parameters, the number of planes were varied, and their effects were studied.

Download Full-text

Fast document summarization using locality sensitive hashing and memory access efficient node ranking

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i3.9030 ◽

2016 ◽

Vol 6 (3) ◽

pp. 945

Author(s):

Ercan Canhasi

Keyword(s):

Time Complexity ◽

Nearest Neighbor ◽

Linear Time ◽

Nearest Neighbor Search ◽

Memory Access ◽

Locality Sensitive Hashing ◽

Document Summarization ◽

Neighbor Search ◽

Node Ranking ◽

Similarity Graph

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm. The common text modeling method connects a pair of sentences based on their similarities. Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences. The quadratic time complexity makes it impractical for large documents. In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection. Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph. In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays. Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity. We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud. In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.

Download Full-text

Theoretical analysis of the distribution of isolated particles in the TASEP: Application to mRNA translation rate estimation

10.1101/147017 ◽

2017 ◽

Cited By ~ 1

Author(s):

Khanh Dao Duc ◽

Zain H. Saleem ◽

Yun S. Song

Keyword(s):

Theoretical Analysis ◽

Nearest Neighbor ◽

Average Distance ◽

Mean Field ◽

Exclusion Process ◽

Mrna Translation ◽

Ribosome Profiling ◽

Asymmetric Exclusion Process ◽

Initiation Rate ◽

Translation Rate

AbstractThe Totally Asymmetric Exclusion Process (TASEP) is a classical stochastic model for describing the transport of interacting particles, such as ribosomes moving along the mRNA during translation. Although this model has been widely studied in the past, the extent of collision between particles and the average distance between a particle to its nearest neighbor have not been quantified explicitly. We provide here a theoretical analysis of such quantities via the distribution of isolated particles. In the classical form of the model in which each particle occupies only a single site, we obtain an exact analytic solution using the Matrix Ansatz. We then employ a refined mean field approach to extend the analysis to a generalized TASEP with particles of an arbitrary size. Our theoretical study has direct applications in mRNA translation and the interpretation of experimental ribosome profiling data. In particular, our analysis of data from S. cerevisiae suggests a potential bias against the detection of nearby ribosomes with gap distance less than ~ 3 codons, which leads to some ambiguity in estimating the initiation rate and protein production flux for a substantial fraction of genes. Despite such ambiguity, however, we demonstrate theoretically that the interference rate associated with collisions can be robustly estimated, and show that approximately 1% of the translating ribosomes get obstructed.

Download Full-text

Improved locality-sensitive hashing method for the approximate nearest neighbor problem

Chinese Physics B ◽

10.1088/1674-1056/23/8/080203 ◽

2014 ◽

Vol 23 (8) ◽

pp. 080203 ◽

Cited By ~ 2

Author(s):

Ying-Hua Lu ◽

Ting-Huai Ma ◽

Shui-Ming Zhong ◽

Jie Cao ◽

Xin Wang ◽

...

Keyword(s):

Nearest Neighbor ◽

Locality Sensitive Hashing ◽

Approximate Nearest Neighbor

Download Full-text

Large-scale semantic exploration of scientific literature using topic-based hashing algorithms

Semantic Web ◽

10.3233/sw-200373 ◽

2020 ◽

Vol 11 (5) ◽

pp. 735-750 ◽

Cited By ~ 1

Author(s):

Carlos Badenes-Olmedo ◽

José Luis Redondo-García ◽

Oscar Corcho

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Feature Space ◽

Similarity Score ◽

Locality Sensitive Hashing ◽

Specific Sequence ◽

Document Similarity ◽

Thematic Information ◽

Low Dimensional ◽

Hash Codes

Searching for similar documents and exploring major themes covered across groups of documents are common activities when browsing collections of scientific papers. This manual knowledge-intensive task can become less tedious and even lead to unexpected relevant findings if unsupervised algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstract them away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. Over this low-dimensional latent space some locality-sensitive hashing algorithms have been proposed to perform document similarity search. However, thematic information gets hidden behind hash codes, preventing thematic exploration and limiting the explanatory capability of topics to justify content-based similarities. This paper presents a novel hashing algorithm based on approximate nearest-neighbor techniques that uses hierarchical sets of topics as hash codes. It not only performs efficient similarity searches, but also allows extending those queries with thematic restrictions explaining the similarity score from the most relevant topics. Extensive evaluations on both scientific and industrial text datasets validate the proposed algorithm in terms of accuracy and efficiency.

Download Full-text

THEORETICAL ANALYSIS OF ADSORBATE-INDUCED ROW-TYPE ALIGNMENTS ON THE FCC(110) SURFACE

Surface Review and Letters ◽

10.1142/s0218625x99000688 ◽

1999 ◽

Vol 06 (05) ◽

pp. 699-704 ◽

Cited By ~ 4

Author(s):

K. YASUTANI ◽

M. KABURAGI ◽

M. KANG

Keyword(s):

Phase Diagram ◽

Theoretical Analysis ◽

Ground State ◽

Nearest Neighbor ◽

Comparison Method ◽

Model Structure ◽

Two Dimensional ◽

Ground State Phase Diagram ◽

Interaction Regime ◽

Beg Model

The structures of adsorbate-induced row-type alignments of the FCC(110) surface are analyzed using the two-dimensional Blume–Emmery–Griffiths (BEG) model with the nearest-neighbor (NN) and the next-nearest-neighbor (NNN) interactions. The ground state phase diagram in whole regimes of interactions is determined by the energy comparison method. Comparing the results of ground state analysis with experimentally observed structures of the O/Rh(110) and O/Pd(110), we determine the interaction regimes for these systems. From the thus determined interaction regime, we propose the model structure in the c(2 × 6) phase of the O/Pd(110).

Download Full-text

A Probabilistic Molecular Fingerprint for Big Data Settings

10.26434/chemrxiv.7176350.v1 ◽

2018 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Large Databases ◽

Nearest Neighbor Searches ◽

Extended Connectivity

Background: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a> Results: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Conclusion<a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a>

Download Full-text

dropClust: Efficient clustering of ultra-large scRNA-seq data

10.1101/170308 ◽

2017 ◽

Cited By ~ 2

Author(s):

Debajyoti Sinha ◽

Akhilesh Kumar ◽

Himanshu Kumar ◽

Sanghamitra Bandyopadhyay ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Large Scale ◽

Best Practice ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

De Novo ◽

Single Cells ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Clustering Methods

ABSTRACTDroplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbor search technique to develop ade novoclustering algorithm for large-scale single cell data. On a number of real datasets, dropClust outperformed the existing best practice methods in terms of execution time, clustering accuracy and detectability of minor cell sub-types.

Download Full-text