Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Mapping Intimacies ◽

10.1101/2020.10.08.330985 ◽

2020 ◽

Author(s):

Enrico Seiler ◽

Svenja Mehringer ◽

Mitra Darvish ◽

Etienne Turc ◽

Knut Reinert

Keyword(s):

Data Structure ◽

Nucleotide Sequences ◽

Bloom Filters ◽

Secondary Memory ◽

Set Membership ◽

Effective Use

AbstractWe present Raptor, a tool for approximately searching many queries in large collections of nucleotide sequences. In comparison with similar tools like Mantis and COBS, Raptor is 12-144 times faster and uses up to 30 times less memory. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the Interleaved Bloom Filters (IBF) as a set membership data structure, and probabilistic thresholding for minimizers. Our approach allows compression and a partitioning of the IBF to enable the effective use of secondary memory.

Download Full-text

Spaced Seed Data Structures forDe NovoAssembly

International Journal of Genomics ◽

10.1155/2015/196591 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 3

Author(s):

Inanç Birol ◽

Justin Chu ◽

Hamid Mohamadi ◽

Shaun D. Jackman ◽

Karthika Raghavan ◽

...

Keyword(s):

Data Structure ◽

Data Structures ◽

De Novo ◽

Bloom Filters ◽

De Bruijn Graph ◽

Sequence Specificity ◽

Sequencing Errors ◽

Spaced Seeds ◽

Read Error Correction ◽

Seed Data

De novoassembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

Download Full-text

A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services

High Performance Computing - HiPC 2006 - Lecture Notes in Computer Science ◽

10.1007/11945918_30 ◽

2006 ◽

pp. 277-288 ◽

Cited By ~ 9

Author(s):

Yu Hua ◽

Bin Xiao

Keyword(s):

Data Structure ◽

Bloom Filters ◽

Network Services ◽

Attribute Data

Download Full-text

DSACL+-tree: A Dynamic Data Structure for Similarity Search in Secondary Memory

10.1007/978-3-642-32153-5_9 ◽

2012 ◽

pp. 116-131 ◽

Cited By ~ 2

Author(s):

Luis Britos ◽

A. Marcela Printista ◽

Nora Reyes

Keyword(s):

Data Structure ◽

Similarity Search ◽

Secondary Memory ◽

Dynamic Data ◽

Dynamic Data Structure

Download Full-text

Oblivious Data Structure for Secure Multiple-Set Membership Testing

10.1007/978-3-030-87571-8_26 ◽

2021 ◽

pp. 299-310

Author(s):

Qin Jiang ◽

Yanjun An ◽

Yong Qi ◽

Hai Fang

Keyword(s):

Data Structure ◽

Set Membership

Download Full-text

Fast Dynamic Multiple-Set Membership Testing Using Combinatorial Bloom Filters

IEEE/ACM Transactions on Networking ◽

10.1109/tnet.2011.2173351 ◽

2012 ◽

Vol 20 (1) ◽

pp. 295-304 ◽

Cited By ~ 25

Author(s):

Fang Hao ◽

Murali Kodialam ◽

T. V. Lakshman ◽

Haoyu Song

Keyword(s):

Bloom Filters ◽

Set Membership ◽

Fast Dynamic

Download Full-text

A neural data structure for novelty detection

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1814448115 ◽

2018 ◽

Vol 115 (51) ◽

pp. 13093-13098 ◽

Cited By ~ 7

Author(s):

Sanjoy Dasgupta ◽

Timothy C. Sheehan ◽

Charles F. Stevens ◽

Saket Navlakha

Keyword(s):

Data Structure ◽

Computer Science ◽

Novelty Detection ◽

Bloom Filter ◽

Fruit Fly ◽

Fruit Flies ◽

Bloom Filters ◽

Neural Data ◽

Biological Problem ◽

Computational Systems

Novelty detection is a fundamental biological problem that organisms must solve to determine whether a given stimulus departs from those previously experienced. In computer science, this problem is solved efficiently using a data structure called a Bloom filter. We found that the fruit fly olfactory circuit evolved a variant of a Bloom filter to assess the novelty of odors. Compared with a traditional Bloom filter, the fly adjusts novelty responses based on two additional features: the similarity of an odor to previously experienced odors and the time elapsed since the odor was last experienced. We elaborate and validate a framework to predict novelty responses of fruit flies to given pairs of odors. We also translate insights from the fly circuit to develop a class of distance- and time-sensitive Bloom filters that outperform prior filters when evaluated on several biological and computational datasets. Overall, our work illuminates the algorithmic basis of an important neurobiological problem and offers strategies for novelty detection in computational systems.

Download Full-text

Multilevel Bloom Filters for P2P Flows Identification Based on Cluster Analysis in Wireless Mesh Network

Discrete Dynamics in Nature and Society ◽

10.1155/2015/801934 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Xia-an Bi ◽

Xiaohui Wang ◽

Luyun Xu ◽

Sheng Chen ◽

Hong Liu

Keyword(s):

Data Structure ◽

Mesh Networks ◽

Level Structure ◽

Traffic Monitoring ◽

Mesh Network ◽

Bloom Filters ◽

Identification Algorithm ◽

Wireless Mesh ◽

Network Operation ◽

P2p Applications

With the development of wireless mesh networks and distributed computing, lots of new P2P services have been deployed and enrich the Internet contents and applications. The rapid growth of P2P flows brings great pressure to the regular network operation. So the effective flow identification and management of P2P applications become increasingly urgent. In this paper, we build a multilevel bloom filters data structure to identify the P2P flows through researches on the locality characteristics of P2P flows. Different level structure stores different numbers of P2P flow rules. According to the characteristics values of the P2P flows, we adjust the parameters of the data structure of bloom filters. The searching steps of the scheme traverse from the first level to the last level. Compared with the traditional algorithms, our method solves the drawbacks of previous schemes. The simulation results demonstrate that our algorithm effectively enhances the performance of P2P flows identification. Then we deploy our flow identification algorithm in the traffic monitoring sensors which belong to the network traffic monitoring system at the export link in the campus network. In the real environment, the experiment results demonstrate that our algorithm has a fast speed and high accuracy to identify the P2P flows; therefore, it is suitable for actual deployment.

Download Full-text

Creating a Concurrent Overflowing Bloom Filter

10.14293/s2199-1006.1.sor-.ppf4wcp.v1 ◽

2019 ◽

Author(s):

Alex Berliner ◽

Brian Estes ◽

Ebin Scaria

Keyword(s):

Data Structure ◽

Recent Literature ◽

Bloom Filter ◽

Bloom Filters ◽

Probabilistic Data ◽

Additional Element ◽

Marginal Value ◽

The Creation ◽

Probabilistic Data Structure

Bloom filters are an efficient probabilistic data structure used to verify membership of an element inside of a set. There is diminishing marginal value for inserting each additional element into a Bloom filter, and so steps must be taken to maintain scalability. One such option is to create a secondary hash set for a particular hash set in a Bloom filter that has become full, known as an overflow area. At this time, there are no implementations of a Bloom filter that implement this overflow system while maintaining concurrency. In this paper, we demonstrate the creation of a concurrent overflow system for Bloom filters. We use the base Bloom filter presented in recent literature and replace their method of dynamically resizing the Bloom filters with our overflow table implementation, as outlined in one of their suggested areas for future exploration. We then compare the results of our Bloom filter with those from the previously mentioned implementation as well as a standard Bloom filter.

Download Full-text

A data structures for purpose of the BIM-based Life Cycle Assessment: A review and theoretical background

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/1209/1/012001 ◽

2021 ◽

Vol 1209 (1) ◽

pp. 012001

Author(s):

M Brandtner ◽

V Venkrbec

Keyword(s):

Life Cycle Assessment ◽

Life Cycle ◽

Data Structure ◽

Theoretical Background ◽

System Construction ◽

Building Information Model ◽

Building Information ◽

Specific Material ◽

Effective Use ◽

Whole Life Cycle

Abstract The article deals with the data structure for the purpose of Life Cycle Assessment (LCA) of buildings using the Building Information Model (BIM). Construction industry produces a significant amount of waste and on the other hand the capacities of landfills are almost filled. It is necessary to deal with the effective use of materials that have already been used and have potential to be reused again. LCA is a method that can be used to demonstrate the suitability of proposed materials, structures or buildings in terms of their whole life cycle and its environmental impact. BIM includes, in addition to geometry, the information part. This data can be used for life cycle inventory (LCI) and then for the assessment itself. The aim of the article is to analyse previous approaches and define which data structure is necessary to be obtained from the BIM model for the LCI purpose of a specific material. The proposed methodology of the data recognition and selection is based on data structure of non-graphical database called SNIM, which was developed for the Czech construction environment. The article is also focused on the theoretical background of the newly developed classification system Construction Classification International (CCI).

Download Full-text

Succinct range filters

Communications of the ACM ◽

10.1145/3450262 ◽

2021 ◽

Vol 64 (4) ◽

pp. 166-173

Author(s):

Huanchen Zhang ◽

Hyeontaek Lim ◽

Viktor Leis ◽

David G. Andersen ◽

Michael Kaminsky ◽

...

Keyword(s):

Information Theory ◽

Data Structure ◽

State Of The Art ◽

Bloom Filters ◽

Range Queries ◽

Database Storage

We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries, such as range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node---a space close to the minimum required by information theory. Our experiments show that SuRF speeds up range queries in a widely used database storage engine by up to 5×.

Download Full-text