scholarly journals A Novel Accuracy and Similarity Search Structure Based on Parallel Bloom Filters

2016 ◽  
Vol 2016 ◽  
pp. 1-12
Author(s):  
Chunyan Shuai ◽  
Hengcheng Yang ◽  
Xin Ouyang ◽  
Siqi Li ◽  
Zheng Chen

In high-dimensional spaces, accuracy and similarity search by low computing and storage costs are always difficult research topics, and there is a balance between efficiency and accuracy. In this paper, we propose a new structure Similar-PBF-PHT to represent items of a set with high dimensions and retrieve accurate and similar items. The Similar-PBF-PHT contains three parts: parallel bloom filters (PBFs), parallel hash tables (PHTs), and a bitmatrix. Experiments show that the Similar-PBF-PHT is effective in membership query and K-nearest neighbors (K-NN) search. With accurate querying, the Similar-PBF-PHT owns low hit false positive probability (FPP) and acceptable memory costs. With K-NN querying, the average overall ratio and rank-i ratio of the Hamming distance are accurate and ratios of the Euclidean distance are acceptable. It takes CPU time not I/O times to retrieve accurate and similar items and can deal with different data formats not only numerical values.

Named Data Networking (NDN) is afast growing architecture, which is proposed as an alternative to existing IP. NDN allows users to request the data identified by a unique name without any information of the hosting entity. NDN supports in-network caching of contents, multi-path forwarding, and data security. In NDN, packet-forwarding decisions are driven by lookup operations on content name of the NDN packets. An NDN node maintains set of routing tables that aid in forwarding decisions. Forwarding the NDN packets depend on lookup of these NDN tables and performing Longest Prefix Matching (LPM) against these NDN tables. The NDN names are unbounded and of variable length. These features along with large and dynamic NDN tables pose several challenges that include increased memory requirement and delayed lookup operations. To this end, there is a need for an efficient data structure that support fast lookup operations with low memory overhead. Several lookup techniques are proposed in this direction. Traversing trie structures would be slow since every level of trie require a memory access. Hash tables incur additional hash computations on names and suffer from collisions. Bloom filters suffer from false positives and do not support deletions. Improving the performance of these structures can lead to a better lookup solution.This survey paper explores different lookup structures for NDN networks. Performance is measured with respect to lookup rate and memory efficiency.


2020 ◽  
Vol 34 (07) ◽  
pp. 12346-12353
Author(s):  
Zhenyu Weng ◽  
Yuesheng Zhu

Binary codes are widely used to represent the data due to their small storage and efficient computation. However, there exists an ambiguity problem that lots of binary codes share the same Hamming distance to a query. To alleviate the ambiguity problem, weighted binary codes assign different weights to each bit of binary codes and compare the binary codes by the weighted Hamming distance. Till now, performing the querying from the weighted binary codes efficiently is still an open issue. In this paper, we propose a new method to rank the weighted binary codes and return the nearest weighted binary codes of the query efficiently. In our method, based on the multi-index hash tables, two algorithms, the table bucket finding algorithm and the table merging algorithm, are proposed to select the nearest weighted binary codes of the query in a non-exhaustive and accurate way. The proposed algorithms are justified by proving their theoretic properties. The experiments on three large-scale datasets validate both the search efficiency and the search accuracy of our method. Especially for the number of weighted binary codes up to one billion, our method shows a great improvement of more than 1000 times faster than the linear scan.


2016 ◽  
Author(s):  
Edmund Hart ◽  
Pauline Barmby ◽  
David LeBauer ◽  
François Michonneau ◽  
Sarah Mount ◽  
...  

Data is the central currency of science, but the nature of scientific data has changed dramatically with the rapid pace of technology. This change has led to the development of a wide variety of data formats, dataset sizes, data complexity, data use cases, and data sharing practices. Improvements in high throughput DNA sequencing, sustained institutional support for large sensor networks, and sky surveys with large-format digital cameras have created massive quantities of data. At the same time, the combination of increasingly diverse research teams and data aggregation in portals (e.g. for biodiversity data, GBIF or iDigBio) necessitates increased coordination among data collectors and institutions. As a consequence, “data” can now mean anything from petabytes of information stored in professionally-maintained databases, through spreadsheets on a single computer, to hand-written tables in lab notebooks on shelves. All remain important, but data curation practices must continue to keep pace with the changes brought about by new forms and practices of data collection and storage.


2019 ◽  
Vol 52 (3) ◽  
pp. 633-646 ◽  
Author(s):  
Soohyung Joo ◽  
Christie Peters

This study assesses the needs of researchers for data-related assistance and investigates their research data management behavior. A survey was conducted, and 186 valid responses were collected from faculty, researchers, and graduate students across different disciplines at a research university. The services for which researchers perceive the greatest need include assistance with quantitative analysis and data visualization. Overall, the need for data-related assistance is relatively higher among health scientists, while humanities researchers demonstrate the lowest need. This study also investigated the data formats used, data documentation and storage practices, and data-sharing behavior of researchers. We found that researchers rarely use metadata standards, but rely more on a standard file-naming scheme. As to data sharing, respondents are likely to share their data personally upon request or as supplementary materials to journal publications. The findings of this study will be useful for planning user-centered research data services in academic libraries.


2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Wei Li ◽  
Kun Huang ◽  
Dafang Zhang ◽  
Zheng Qin

Bloom filters are space-efficient randomized data structures for fast membership queries, allowing false positives. Counting Bloom Filters (CBFs) perform the same operations on dynamic sets that can be updated via insertions and deletions. CBFs have been extensively used in MapReduce to accelerate large-scale data processing on large clusters by reducing the volume of datasets. The false positive probability of CBF should be made as low as possible for filtering out more redundant datasets. In this paper, we propose a multilevel optimization approach to building an Accurate Counting Bloom Filter (ACBF) for reducing the false positive probability. ACBF is constructed by partitioning the counter vector into multiple levels. We propose an optimized ACBF by maximizing the first level size, in order to minimize the false positive probability while maintaining the same functionality as CBF. Simulation results show that the optimized ACBF reduces the false positive probability by up to 98.4% at the same memory consumption compared to CBF. We also implement ACBFs in MapReduce to speed up the reduce-side join. Experiments on realistic datasets show that ACBF reduces the false positive probability by 72.3% as well as the map outputs by 33.9% and improves the join execution times by 20% compared to CBF.


2014 ◽  
Vol 644-650 ◽  
pp. 3365-3370
Author(s):  
Zhen Hong Guo ◽  
Lin Li ◽  
Qing Wang ◽  
Meng Lin ◽  
Rui Pan

With the rapid development of the Internet, the number of firewall rules is increasing. The enormous quantity of rules challenges the performance of the packet classification that has already become a bottleneck in firewalls. This dissertation proposes a rapid and multi-dimensional algorithm for packet classification based on BSOL(Binary Search On Leaves), which is named FMPC(FastMulti-dimensional Packet Classification). Different from BSOL, FMPC cuts all dimensions at the same time to decompose rule spaces and stores leaf spaces into hash tables; FMPC constructs a Bloom Filter for every hash table and stores them into embedded SRAM. When classifying a packet, FMPC performs parallel queries on Bloom Filters and determines how to visit hash tables according to the results. Algorithm analysis and the result of simulations show: the average number of hash-table lookups of FMPC is 1 when classifying a packet, which is much smaller than that of BSOL; inthe worst case, the number of hash-table lookups of FMPCisO(logwmax+1⁡), which is also smaller than that of BSOL in multi-dimensional environment, where wmax is the length, in bits, of the dimension whose length is the longest..


2016 ◽  
Vol 13 (Supplement 1) ◽  
pp. 72-86
Author(s):  
Guo Zhang ◽  
Jianhui Zhang ◽  
Binqiang Wang ◽  
Zhen Zhang

Sign in / Sign up

Export Citation Format

Share Document