A Novel Accuracy and Similarity Search Structure Based on Parallel Bloom Filters

In high-dimensional spaces, accuracy and similarity search by low computing and storage costs are always difficult research topics, and there is a balance between efficiency and accuracy. In this paper, we propose a new structure Similar-PBF-PHT to represent items of a set with high dimensions and retrieve accurate and similar items. The Similar-PBF-PHT contains three parts: parallel bloom filters (PBFs), parallel hash tables (PHTs), and a bitmatrix. Experiments show that the Similar-PBF-PHT is effective in membership query and K-nearest neighbors (K-NN) search. With accurate querying, the Similar-PBF-PHT owns low hit false positive probability (FPP) and acceptable memory costs. With K-NN querying, the average overall ratio and rank-i ratio of the Hamming distance are accurate and ratios of the Euclidean distance are acceptable. It takes CPU time not I/O times to retrieve accurate and similar items and can deal with different data formats not only numerical values.

Download Full-text

Efficient Lookup Solutions for Named Data Networks

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1097.1292s19 ◽

2019 ◽

Vol 9 (2S) ◽

pp. 621-626

Keyword(s):

Data Security ◽

Bloom Filters ◽

Data Networks ◽

Hash Tables ◽

Survey Paper ◽

Memory Efficiency ◽

Memory Overhead ◽

Efficient Data ◽

Longest Prefix Matching ◽

Routing Tables

Named Data Networking (NDN) is afast growing architecture, which is proposed as an alternative to existing IP. NDN allows users to request the data identified by a unique name without any information of the hosting entity. NDN supports in-network caching of contents, multi-path forwarding, and data security. In NDN, packet-forwarding decisions are driven by lookup operations on content name of the NDN packets. An NDN node maintains set of routing tables that aid in forwarding decisions. Forwarding the NDN packets depend on lookup of these NDN tables and performing Longest Prefix Matching (LPM) against these NDN tables. The NDN names are unbounded and of variable length. These features along with large and dynamic NDN tables pose several challenges that include increased memory requirement and delayed lookup operations. To this end, there is a need for an efficient data structure that support fast lookup operations with low memory overhead. Several lookup techniques are proposed in this direction. Traversing trie structures would be slow since every level of trie require a memory access. Hash tables incur additional hash computations on names and suffer from collisions. Bloom filters suffer from false positives and do not support deletions. Improving the performance of these structures can lead to a better lookup solution.This survey paper explores different lookup structures for NDN networks. Performance is measured with respect to lookup rate and memory efficiency.

Download Full-text

False-Positive Probability and Compression Optimization for Tree-Structured Bloom Filters

ACM Transactions on Modeling and Performance Evaluation of Computing Systems ◽

10.1145/2940324 ◽

2016 ◽

Vol 1 (4) ◽

pp. 1-39

Author(s):

Yongquan Fu ◽

Ernst Biersack

Keyword(s):

False Positive ◽

Bloom Filters ◽

Positive Probability ◽

False Positive Probability

Download Full-text

Lower Bounds on Performance of Metric Tree Indexing Schemes for Exact Similarity Search in High Dimensions

Algorithmica ◽

10.1007/s00453-012-9638-2 ◽

2012 ◽

Vol 66 (2) ◽

pp. 310-328 ◽

Cited By ~ 6

Author(s):

Vladimir Pestov

Keyword(s):

Lower Bounds ◽

Similarity Search ◽

High Dimensions ◽

Metric Tree

Download Full-text

Efficient Querying from Weighted Binary Codes

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6919 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12346-12353

Author(s):

Zhenyu Weng ◽

Yuesheng Zhu

Keyword(s):

Great Improvement ◽

Large Scale ◽

Hamming Distance ◽

New Method ◽

Binary Codes ◽

Search Efficiency ◽

Efficient Computation ◽

Hash Tables ◽

Multi Index ◽

Open Issue

Binary codes are widely used to represent the data due to their small storage and efficient computation. However, there exists an ambiguity problem that lots of binary codes share the same Hamming distance to a query. To alleviate the ambiguity problem, weighted binary codes assign different weights to each bit of binary codes and compare the binary codes by the weighted Hamming distance. Till now, performing the querying from the weighted binary codes efficiently is still an open issue. In this paper, we propose a new method to rank the weighted binary codes and return the nearest weighted binary codes of the query efficiently. In our method, based on the multi-index hash tables, two algorithms, the table bucket finding algorithm and the table merging algorithm, are proposed to select the nearest weighted binary codes of the query in a non-exhaustive and accurate way. The proposed algorithms are justified by proving their theoretic properties. The experiments on three large-scale datasets validate both the search efficiency and the search accuracy of our method. Especially for the number of weighted binary codes up to one billion, our method shows a great improvement of more than 1000 times faster than the linear scan.

Download Full-text

Ten simple rules for digital data storage

10.7287/peerj.preprints.1448 ◽

2016 ◽

Author(s):

Edmund Hart ◽

Pauline Barmby ◽

David LeBauer ◽

François Michonneau ◽

Sarah Mount ◽

...

Keyword(s):

Data Storage ◽

Scientific Data ◽

Digital Data ◽

Digital Cameras ◽

Data Formats ◽

Simple Rules ◽

High Throughput Dna Sequencing ◽

Rapid Pace ◽

Digital Data Storage ◽

And Storage

Data is the central currency of science, but the nature of scientific data has changed dramatically with the rapid pace of technology. This change has led to the development of a wide variety of data formats, dataset sizes, data complexity, data use cases, and data sharing practices. Improvements in high throughput DNA sequencing, sustained institutional support for large sensor networks, and sky surveys with large-format digital cameras have created massive quantities of data. At the same time, the combination of increasingly diverse research teams and data aggregation in portals (e.g. for biodiversity data, GBIF or iDigBio) necessitates increased coordination among data collectors and institutions. As a consequence, “data” can now mean anything from petabytes of information stored in professionally-maintained databases, through spreadsheets on a single computer, to hand-written tables in lab notebooks on shelves. All remain important, but data curation practices must continue to keep pace with the changes brought about by new forms and practices of data collection and storage.

Download Full-text

User needs assessment for research data services in a research university

Journal of Librarianship and Information Science ◽

10.1177/0961000619856073 ◽

2019 ◽

Vol 52 (3) ◽

pp. 633-646 ◽

Cited By ~ 2

Author(s):

Soohyung Joo ◽

Christie Peters

Keyword(s):

Data Sharing ◽

Research University ◽

Research Data ◽

User Needs ◽

Data Services ◽

Data Formats ◽

Metadata Standards ◽

Data Documentation ◽

And Storage ◽

Sharing Behavior

This study assesses the needs of researchers for data-related assistance and investigates their research data management behavior. A survey was conducted, and 186 valid responses were collected from faculty, researchers, and graduate students across different disciplines at a research university. The services for which researchers perceive the greatest need include assistance with quantitative analysis and data visualization. Overall, the need for data-related assistance is relatively higher among health scientists, while humanities researchers demonstrate the lowest need. This study also investigated the data formats used, data documentation and storage practices, and data-sharing behavior of researchers. We found that researchers rarely use metadata standards, but rely more on a standard file-naming scheme. As to data sharing, respondents are likely to share their data personally upon request or as supplementary materials to journal publications. The findings of this study will be useful for planning user-centered research data services in academic libraries.

Download Full-text

Accurate Counting Bloom Filters for Large-Scale Data Processing

Mathematical Problems in Engineering ◽

10.1155/2013/516298 ◽

2013 ◽

Vol 2013 ◽

pp. 1-11 ◽

Cited By ~ 6

Author(s):

Wei Li ◽

Kun Huang ◽

Dafang Zhang ◽

Zheng Qin

Keyword(s):

Data Processing ◽

False Positive ◽

Large Scale ◽

Bloom Filters ◽

Positive Probability ◽

Large Scale Data ◽

Large Scale Data Processing ◽

False Positive Probability ◽

Scale Data ◽

Counting Bloom Filters

Bloom filters are space-efficient randomized data structures for fast membership queries, allowing false positives. Counting Bloom Filters (CBFs) perform the same operations on dynamic sets that can be updated via insertions and deletions. CBFs have been extensively used in MapReduce to accelerate large-scale data processing on large clusters by reducing the volume of datasets. The false positive probability of CBF should be made as low as possible for filtering out more redundant datasets. In this paper, we propose a multilevel optimization approach to building an Accurate Counting Bloom Filter (ACBF) for reducing the false positive probability. ACBF is constructed by partitioning the counter vector into multiple levels. We propose an optimized ACBF by maximizing the first level size, in order to minimize the false positive probability while maintaining the same functionality as CBF. Simulation results show that the optimized ACBF reduces the false positive probability by up to 98.4% at the same memory consumption compared to CBF. We also implement ACBFs in MapReduce to speed up the reduce-side join. Experiments on realistic datasets show that ACBF reduces the false positive probability by 72.3% as well as the map outputs by 33.9% and improves the join execution times by 20% compared to CBF.

Download Full-text

Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters

2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) ◽

10.1109/fuzz-ieee.2018.8491658 ◽

2018 ◽

Cited By ~ 1

Author(s):

Thi-To-Quyen TRAN ◽

Thuong-Cang PHAN ◽

Anne LAURENT ◽

Laurent DrOrazio

Keyword(s):

Hamming Distance ◽

Bloom Filters

Download Full-text

FMPC: A Fast Multi-Dimensional Packet Classification Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.644-650.3365 ◽

2014 ◽

Vol 644-650 ◽

pp. 3365-3370

Author(s):

Zhen Hong Guo ◽

Lin Li ◽

Qing Wang ◽

Meng Lin ◽

Rui Pan

Keyword(s):

Rapid Development ◽

Hash Table ◽

Bloom Filter ◽

Packet Classification ◽

Binary Search ◽

Bloom Filters ◽

Algorithm Analysis ◽

Hash Tables ◽

Worst Case ◽

Embedded Sram

With the rapid development of the Internet, the number of firewall rules is increasing. The enormous quantity of rules challenges the performance of the packet classification that has already become a bottleneck in firewalls. This dissertation proposes a rapid and multi-dimensional algorithm for packet classification based on BSOL(Binary Search On Leaves), which is named FMPC(FastMulti-dimensional Packet Classification). Different from BSOL, FMPC cuts all dimensions at the same time to decompose rule spaces and stores leaf spaces into hash tables; FMPC constructs a Bloom Filter for every hash table and stores them into embedded SRAM. When classifying a packet, FMPC performs parallel queries on Bloom Filters and determines how to visit hash tables according to the results. Algorithm analysis and the result of simulations show: the average number of hash-table lookups of FMPC is 1 when classifying a packet, which is much smaller than that of BSOL; inthe worst case, the number of hash-table lookups of FMPCisO(logwmax+1⁡), which is also smaller than that of BSOL in multi-dimensional environment, where wmax is the length, in bits, of the dimension whose length is the longest..

Download Full-text

On-line popularity monitoring method based on bloom filters and hash tables for differentiated traffic

China Communications ◽

10.1109/cc.0.7560897 ◽

2016 ◽

Vol 13 (Supplement 1) ◽

pp. 72-86

Author(s):

Guo Zhang ◽

Jianhui Zhang ◽

Binqiang Wang ◽

Zhen Zhang

Keyword(s):

Bloom Filters ◽

Hash Tables ◽

Monitoring Method ◽

On Line

Download Full-text