AllSome Sequence Bloom Trees

AbstractThe ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39 - 85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 hours (compared to around two days previously) and a whole set of k-mers from a sequencing experiment (about 27 mil k-mers) in under 11 minutes.

Download Full-text

A Scalable Algorithm for Constructing Frequent Pattern Tree

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2014010103 ◽

2014 ◽

Vol 10 (1) ◽

pp. 42-56 ◽

Cited By ~ 3

Author(s):

Zailani Abdullah ◽

Tutut Herawan ◽

A. Noraziah ◽

Mustafa Mat Deris

Keyword(s):

Data Structure ◽

Frequent Pattern ◽

Frequent Patterns ◽

Scalable Algorithm ◽

Tree Construction ◽

Frequent Pattern Tree ◽

Support Threshold ◽

Benchmark Datasets ◽

Tree Data ◽

Tree Data Structure

Frequent Pattern Tree (FP-Tree) is a compact data structure of representing frequent itemsets. The construction of FP-Tree is very important prior to frequent patterns mining. However, there have been too limited efforts specifically focused on constructing FP-Tree data structure beyond from its original database. In typical FP-Tree construction, besides the prior knowledge on support threshold, it also requires two database scans; first to build and sort the frequent patterns and second to build its prefix paths. Thus, twice database scanning is a key and major limitation in completing the construction of FP-Tree. Therefore, this paper suggests scalable Trie Transformation Technique Algorithm (T3A) to convert our predefined tree data structure, Disorder Support Trie Itemset (DOSTrieIT) into FP-Tree. Experiment results through two UCI benchmark datasets show that the proposed T3A generates FP-Tree up to 3 magnitudes faster than that the benchmarked FP-Growth.

Download Full-text

Improved representation of sequence Bloom trees

10.1101/501452 ◽

2018 ◽

Cited By ~ 4

Author(s):

Robert S. Harris ◽

Paul Medvedev

Keyword(s):

Data Structure ◽

Biological Databases ◽

Rna Seq ◽

End User ◽

Indexing Methods ◽

Fundamental Part ◽

Sequence Read Archive ◽

Source Program ◽

Free Open Source ◽

Generation Sequencing

AbstractAlgorithmic solutions to index and search biological databases are a fundamental part of bioinformatics, providing underlying components to many end-user tools. Inexpensive next generation sequencing has filled publicly available databases such as the Sequence Read Archive beyond the capacity of traditional indexing methods. Recently, the Sequence Bloom Tree (SBT) and its derivatives were proposed as a way to efficiently index such data for queries about transcript presence. We build on the SBT framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index. We evaluate HowDe-SBT by both proving theoretical bounds on its performance and using real RNA-seq data. Compared to previous SBT methods, HowDe-SBT can construct the index in less than 36% the time, and with 39% less space, and can answer small-batch queries at least five times faster. HowDe-SBT is available as a free open source program on https://github.com/medvedevgroup/HowDeSBT.

Download Full-text