scholarly journals Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

2020 ◽  
Author(s):  
Enrico Seiler ◽  
Svenja Mehringer ◽  
Mitra Darvish ◽  
Etienne Turc ◽  
Knut Reinert

AbstractWe present Raptor, a tool for approximately searching many queries in large collections of nucleotide sequences. In comparison with similar tools like Mantis and COBS, Raptor is 12-144 times faster and uses up to 30 times less memory. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the Interleaved Bloom Filters (IBF) as a set membership data structure, and probabilistic thresholding for minimizers. Our approach allows compression and a partitioning of the IBF to enable the effective use of secondary memory.

2015 ◽  
Vol 2015 ◽  
pp. 1-8 ◽  
Author(s):  
Inanç Birol ◽  
Justin Chu ◽  
Hamid Mohamadi ◽  
Shaun D. Jackman ◽  
Karthika Raghavan ◽  
...  

De novoassembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.


2021 ◽  
pp. 299-310
Author(s):  
Qin Jiang ◽  
Yanjun An ◽  
Yong Qi ◽  
Hai Fang

2012 ◽  
Vol 20 (1) ◽  
pp. 295-304 ◽  
Author(s):  
Fang Hao ◽  
Murali Kodialam ◽  
T. V. Lakshman ◽  
Haoyu Song

2018 ◽  
Vol 115 (51) ◽  
pp. 13093-13098 ◽  
Author(s):  
Sanjoy Dasgupta ◽  
Timothy C. Sheehan ◽  
Charles F. Stevens ◽  
Saket Navlakha

Novelty detection is a fundamental biological problem that organisms must solve to determine whether a given stimulus departs from those previously experienced. In computer science, this problem is solved efficiently using a data structure called a Bloom filter. We found that the fruit fly olfactory circuit evolved a variant of a Bloom filter to assess the novelty of odors. Compared with a traditional Bloom filter, the fly adjusts novelty responses based on two additional features: the similarity of an odor to previously experienced odors and the time elapsed since the odor was last experienced. We elaborate and validate a framework to predict novelty responses of fruit flies to given pairs of odors. We also translate insights from the fly circuit to develop a class of distance- and time-sensitive Bloom filters that outperform prior filters when evaluated on several biological and computational datasets. Overall, our work illuminates the algorithmic basis of an important neurobiological problem and offers strategies for novelty detection in computational systems.


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Xia-an Bi ◽  
Xiaohui Wang ◽  
Luyun Xu ◽  
Sheng Chen ◽  
Hong Liu

With the development of wireless mesh networks and distributed computing, lots of new P2P services have been deployed and enrich the Internet contents and applications. The rapid growth of P2P flows brings great pressure to the regular network operation. So the effective flow identification and management of P2P applications become increasingly urgent. In this paper, we build a multilevel bloom filters data structure to identify the P2P flows through researches on the locality characteristics of P2P flows. Different level structure stores different numbers of P2P flow rules. According to the characteristics values of the P2P flows, we adjust the parameters of the data structure of bloom filters. The searching steps of the scheme traverse from the first level to the last level. Compared with the traditional algorithms, our method solves the drawbacks of previous schemes. The simulation results demonstrate that our algorithm effectively enhances the performance of P2P flows identification. Then we deploy our flow identification algorithm in the traffic monitoring sensors which belong to the network traffic monitoring system at the export link in the campus network. In the real environment, the experiment results demonstrate that our algorithm has a fast speed and high accuracy to identify the P2P flows; therefore, it is suitable for actual deployment.


Author(s):  
Alex Berliner ◽  
Brian Estes ◽  
Ebin Scaria

Bloom filters are an efficient probabilistic data structure used to verify membership of an element inside of a set. There is diminishing marginal value for inserting each additional element into a Bloom filter, and so steps must be taken to maintain scalability. One such option is to create a secondary hash set for a particular hash set in a Bloom filter that has become full, known as an overflow area. At this time, there are no implementations of a Bloom filter that implement this overflow system while maintaining concurrency. In this paper, we demonstrate the creation of a concurrent overflow system for Bloom filters. We use the base Bloom filter presented in recent literature and replace their method of dynamically resizing the Bloom filters with our overflow table implementation, as outlined in one of their suggested areas for future exploration. We then compare the results of our Bloom filter with those from the previously mentioned implementation as well as a standard Bloom filter.


2021 ◽  
Vol 1209 (1) ◽  
pp. 012001
Author(s):  
M Brandtner ◽  
V Venkrbec

Abstract The article deals with the data structure for the purpose of Life Cycle Assessment (LCA) of buildings using the Building Information Model (BIM). Construction industry produces a significant amount of waste and on the other hand the capacities of landfills are almost filled. It is necessary to deal with the effective use of materials that have already been used and have potential to be reused again. LCA is a method that can be used to demonstrate the suitability of proposed materials, structures or buildings in terms of their whole life cycle and its environmental impact. BIM includes, in addition to geometry, the information part. This data can be used for life cycle inventory (LCI) and then for the assessment itself. The aim of the article is to analyse previous approaches and define which data structure is necessary to be obtained from the BIM model for the LCI purpose of a specific material. The proposed methodology of the data recognition and selection is based on data structure of non-graphical database called SNIM, which was developed for the Czech construction environment. The article is also focused on the theoretical background of the newly developed classification system Construction Classification International (CCI).


2021 ◽  
Vol 64 (4) ◽  
pp. 166-173
Author(s):  
Huanchen Zhang ◽  
Hyeontaek Lim ◽  
Viktor Leis ◽  
David G. Andersen ◽  
Michael Kaminsky ◽  
...  

We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries, such as range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node---a space close to the minimum required by information theory. Our experiments show that SuRF speeds up range queries in a widely used database storage engine by up to 5×.


Sign in / Sign up

Export Citation Format

Share Document