A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services

De novoassembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

Download Full-text

Study on cadastral basic attribute data structure based on man-land relationship

10.1117/12.838684 ◽

2009 ◽

Author(s):

Changgen Zhan ◽

Yaolin Liu

Keyword(s):

Data Structure ◽

Attribute Data

Download Full-text

Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

Journal of Information and Data Management ◽

10.5753/jidm.2021.1969 ◽

2021 ◽

Vol 12 (3) ◽

Author(s):

Leonardo Andrade Ribeiro ◽

Felipe Ferreira Borges ◽

Diego Oliveira

Keyword(s):

Data Structure ◽

Processing Time ◽

Cost Model ◽

Similarity Join ◽

Attribute Data ◽

Join Algorithms ◽

Filtering Technique ◽

Alternative Approaches ◽

Similarity Joins ◽

Single Set

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.

Download Full-text

A neural data structure for novelty detection

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1814448115 ◽

2018 ◽

Vol 115 (51) ◽

pp. 13093-13098 ◽

Cited By ~ 7

Author(s):

Sanjoy Dasgupta ◽

Timothy C. Sheehan ◽

Charles F. Stevens ◽

Saket Navlakha

Keyword(s):

Data Structure ◽

Computer Science ◽

Novelty Detection ◽

Bloom Filter ◽

Fruit Fly ◽

Fruit Flies ◽

Bloom Filters ◽

Neural Data ◽

Biological Problem ◽

Computational Systems

Novelty detection is a fundamental biological problem that organisms must solve to determine whether a given stimulus departs from those previously experienced. In computer science, this problem is solved efficiently using a data structure called a Bloom filter. We found that the fruit fly olfactory circuit evolved a variant of a Bloom filter to assess the novelty of odors. Compared with a traditional Bloom filter, the fly adjusts novelty responses based on two additional features: the similarity of an odor to previously experienced odors and the time elapsed since the odor was last experienced. We elaborate and validate a framework to predict novelty responses of fruit flies to given pairs of odors. We also translate insights from the fly circuit to develop a class of distance- and time-sensitive Bloom filters that outperform prior filters when evaluated on several biological and computational datasets. Overall, our work illuminates the algorithmic basis of an important neurobiological problem and offers strategies for novelty detection in computational systems.

Download Full-text

Multilevel Bloom Filters for P2P Flows Identification Based on Cluster Analysis in Wireless Mesh Network

Discrete Dynamics in Nature and Society ◽

10.1155/2015/801934 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Xia-an Bi ◽

Xiaohui Wang ◽

Luyun Xu ◽

Sheng Chen ◽

Hong Liu

Keyword(s):

Data Structure ◽

Mesh Networks ◽

Level Structure ◽

Traffic Monitoring ◽

Mesh Network ◽

Bloom Filters ◽

Identification Algorithm ◽

Wireless Mesh ◽

Network Operation ◽

P2p Applications

With the development of wireless mesh networks and distributed computing, lots of new P2P services have been deployed and enrich the Internet contents and applications. The rapid growth of P2P flows brings great pressure to the regular network operation. So the effective flow identification and management of P2P applications become increasingly urgent. In this paper, we build a multilevel bloom filters data structure to identify the P2P flows through researches on the locality characteristics of P2P flows. Different level structure stores different numbers of P2P flow rules. According to the characteristics values of the P2P flows, we adjust the parameters of the data structure of bloom filters. The searching steps of the scheme traverse from the first level to the last level. Compared with the traditional algorithms, our method solves the drawbacks of previous schemes. The simulation results demonstrate that our algorithm effectively enhances the performance of P2P flows identification. Then we deploy our flow identification algorithm in the traffic monitoring sensors which belong to the network traffic monitoring system at the export link in the campus network. In the real environment, the experiment results demonstrate that our algorithm has a fast speed and high accuracy to identify the P2P flows; therefore, it is suitable for actual deployment.

Download Full-text

Creating a Concurrent Overflowing Bloom Filter

10.14293/s2199-1006.1.sor-.ppf4wcp.v1 ◽

2019 ◽

Author(s):

Alex Berliner ◽

Brian Estes ◽

Ebin Scaria

Keyword(s):

Data Structure ◽

Recent Literature ◽

Bloom Filter ◽

Bloom Filters ◽

Probabilistic Data ◽

Additional Element ◽

Marginal Value ◽

The Creation ◽

Probabilistic Data Structure

Bloom filters are an efficient probabilistic data structure used to verify membership of an element inside of a set. There is diminishing marginal value for inserting each additional element into a Bloom filter, and so steps must be taken to maintain scalability. One such option is to create a secondary hash set for a particular hash set in a Bloom filter that has become full, known as an overflow area. At this time, there are no implementations of a Bloom filter that implement this overflow system while maintaining concurrency. In this paper, we demonstrate the creation of a concurrent overflow system for Bloom filters. We use the base Bloom filter presented in recent literature and replace their method of dynamically resizing the Bloom filters with our overflow table implementation, as outlined in one of their suggested areas for future exploration. We then compare the results of our Bloom filter with those from the previously mentioned implementation as well as a standard Bloom filter.

Download Full-text

Succinct range filters

Communications of the ACM ◽

10.1145/3450262 ◽

2021 ◽

Vol 64 (4) ◽

pp. 166-173

Author(s):

Huanchen Zhang ◽

Hyeontaek Lim ◽

Viktor Leis ◽

David G. Andersen ◽

Michael Kaminsky ◽

...

Keyword(s):

Information Theory ◽

Data Structure ◽

State Of The Art ◽

Bloom Filters ◽

Range Queries ◽

Database Storage

We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries, such as range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node---a space close to the minimum required by information theory. Our experiments show that SuRF speeds up range queries in a widely used database storage engine by up to 5×.

Download Full-text

Balanced counting Bloom filters: a space-efficient synoptic data structure for a high-performance network

IET Communications ◽

10.1049/iet-com.2011.0961 ◽

2012 ◽

Vol 6 (15) ◽

pp. 2259-2266 ◽

Cited By ~ 2

Author(s):

Z. Zhang ◽

J. Liu ◽

B.Q. Wang

Keyword(s):

Data Structure ◽

High Performance ◽

Bloom Filters ◽

Counting Bloom Filters

Download Full-text

Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

10.1101/2020.10.08.330985 ◽

2020 ◽

Author(s):

Enrico Seiler ◽

Svenja Mehringer ◽

Mitra Darvish ◽

Etienne Turc ◽

Knut Reinert

Keyword(s):

Data Structure ◽

Nucleotide Sequences ◽

Bloom Filters ◽

Secondary Memory ◽

Set Membership ◽

Effective Use

AbstractWe present Raptor, a tool for approximately searching many queries in large collections of nucleotide sequences. In comparison with similar tools like Mantis and COBS, Raptor is 12-144 times faster and uses up to 30 times less memory. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the Interleaved Bloom Filters (IBF) as a set membership data structure, and probabilistic thresholding for minimizers. Our approach allows compression and a partitioning of the IBF to enable the effective use of secondary memory.

Download Full-text

findere: fast and precise approximate membership query

10.1101/2021.05.31.446182 ◽

2021 ◽

Author(s):

Lucas Robidou ◽

Pierre Peterlongo

Keyword(s):

Data Structure ◽

False Positive ◽

False Positive Rate ◽

False Negative ◽

Bloom Filters ◽

Membership Query ◽

Simple Strategy ◽

Large Sets ◽

Positive Rate ◽

Speed Up

Approximate membership query (AMQ) structures as Cuckoo filters or Bloom filters are widely used for representing large sets of elements. Their lightweight space usage explains their success, mainly as they are the only way to scale hundreds of billions or trillions of elements. However, they suffer by nature from non-avoidable false-positive calls that bias downstream analyses of methods using these data structures. In this work we propose a simple strategy and its implementation for reducing the false-positive rate of any AMQ data structure indexing k-mers (words of length k). The method we propose, called findere, enables to speed-up the queries by a factor two and to decrease the false-positive rate by two order of magnitudes. This achievement is done one the fly at query time, without modifying the original indexing data-structure, without generating false-negative calls and with no memory overhead. With no drawback, this method, as simple as it is effective, reduces either the false-positive rate or the space required to represent a set given a user-defined false-positive rate.

Download Full-text