Bedtk: Finding Interval Overlap with Implicit Interval Tree

Bedtk: finding interval overlap with implicit interval tree

Bioinformatics ◽

10.1093/bioinformatics/btaa827 ◽

2020 ◽

Author(s):

Heng Li ◽

Jiazhen Rong

Keyword(s):

Data Structure ◽

Source Code ◽

Interval Tree

Abstract Summary We present bedtk, a new toolkit for manipulating genomic intervals in the BED format. It supports sorting, merging, intersection, subtraction and the calculation of the breadth of coverage. Bedtk uses implicit interval tree, a data structure for fast interval overlap queries. It is several to tens of times faster than existing tools and tends to use less memory. Availability and implementation The source code is available at https://github.com/lh3/bedtk.

Download Full-text

A space and time-efficient index for the compacted colored de Bruijn graph

10.1101/191874 ◽

2017 ◽

Cited By ~ 3

Author(s):

Fatemeh Almodaresi ◽

Hirak Sarkar ◽

Rob Patro

Keyword(s):

Data Structure ◽

Pattern Search ◽

De Bruijn Graph ◽

Existing Structures ◽

Link Type ◽

Reference Information ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Asymptotically Efficient ◽

Fast Access

AbstractWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search.Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing, carefully organizing our data structure, and making use of succinct representations where applicable, our data structure provides practically fast k-mer lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme built on the same underlying representation, which provides the ability to trade off k-mer query speed for a reduction in the de Bruijn graph index size. We believe this representation strikes a desirable balance between speed and space usage, and it will allow for fast search on large reference sequences.Pufferfish is developed in C++11, is open source (GPL v3), and is available at https://github.com/COMBINE-lab/Pufferfish. The scripts used to generate the results in this manuscript are available at https://github.com/COMBINE-lab/pufferfish_experiments.

Download Full-text

Decoupling alignment strategy from feature quantification using a standard alignment incidence data structure

10.1101/2021.02.16.431379 ◽

2021 ◽

Author(s):

Kwangbom Choi ◽

Matthew J. Vincent ◽

Gary A. Churchill

Keyword(s):

Gene Expression ◽

Data Structure ◽

Genomic Feature ◽

File Format ◽

Read Alignment ◽

Incidence Data ◽

Link Type ◽

Generic Data ◽

Alignment Strategy ◽

Standard Alignment

AbstractSummaryThe abundance of genomic feature such as gene expression is often estimated from observed total number of alignment incidences in the targeted genome regions. We introduce a generic data structure and associated file format for alignment incidence data so that method developers can create novel pipelines comprising models, each optimal for read alignment, post-alignment QC, and quantification across multiple sequencing modalities.Availability and Implementationalntools software is freely available at https://github.com/churchill-lab/alntools under MIT [email protected] or [email protected]

Download Full-text

A Data Structure on Interval Graphs and Its Applications

Journal of Circuits System and Computers ◽

10.1142/s0218126697000127 ◽

1997 ◽

Vol 07 (03) ◽

pp. 165-175 ◽

Cited By ~ 10

Author(s):

Madhumangal Pal ◽

G. P. Bhattacharjee

Keyword(s):

Data Structure ◽

Data Structures ◽

Interval Graph ◽

Interval Graphs ◽

Point Of View ◽

Interval Tree

In this papar, a new data structure, interval tree (IT), is introduced for an interval graph. Some important properties of IT are studies from the algorithmic point of view. It has many advantages compared to the data structures which are commonly used to solve the problems on interval graphs. Using the properties of IT, the following problems are solved on interval graphs: (i) shortest distances between any two vertices, and (ii) the diameter of the graph.

Download Full-text

Query co-planning for shared execution in key-value stores

Journal of Information and Data Management ◽

10.5753/jidm.2021.1946 ◽

2021 ◽

Vol 12 (5) ◽

Author(s):

Josué Ttito ◽

Renato Marroquín ◽

Sergio Lifschitz ◽

Lewis McGibbney ◽

José Talavera

Keyword(s):

Data Structure ◽

Data Structures ◽

Data Model ◽

Range Queries ◽

Model Data ◽

Data Movement ◽

Segment Tree ◽

Interval Tree ◽

Simple Interface ◽

Arbitrary Objects

Key-value stores propose a straightforward yet powerful data model. Data is modeled using key-value pairs where values can be arbitrary objects and written/read using the key associated with it. In addition to their simple interface, such data stores also provide read operations such as full and range scans. However, due to the simplicity of its interface, trying to optimize data accesses becomes challenging. This work aims to enable the shared execution of concurrent range and point queries on key-value stores. Thus, reducing the overall data movement when executing a complete workload. To accomplish this, we analyze different possible data structures and propose our variation of a segment tree, Updatable Interval Tree. Our data structure helps us co-planning and co-executing multiple range queries together and reduces redundant work. This results in executing workloads more efficiently and overall increased throughput, as we show in our evaluation.

Download Full-text

Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords

F1000Research ◽

10.12688/f1000research.17927.1 ◽

2019 ◽

Vol 8 ◽

pp. 145

Author(s):

Tamer Gur

Keyword(s):

Data Structure ◽

Web Services ◽

Software Tools ◽

Web Interface ◽

Bioinformatics Tool ◽

Link Type ◽

Executable File ◽

Tree Data ◽

Tree Data Structure ◽

Gene Symbols

Due to their nature, bioinformatics datasets are often closely related to each other. For this reason, search, mapping and visualization of these relations are often performed manually or programmatically via identifiers or special keywords such as gene symbols. Although various tools exist for these situations, the growing volume of bioinformatics datasets, emerging new software tools and approaches motivates new solutions. To provide a new tool for these current cases, I present the Biobtree bioinformatics tool. Biobtree effectively fetches and indexes identifiers and special keywords with their related identifiers from supported datasets, optionally with user pre-defined datasets and provides a web interface, web services and direct B+ tree data structure based single uniform database output. Biobtree can handle billions of identifiers and runs via a single executable file with no installation and dependency required. It also aims to provide a relatively small codebase for easy maintenance, addition of new features and extension to larger datasets. Biobtree is available to download from GitHub.

Download Full-text

COMPACT INTERVAL TREES: A DATA STRUCTURE FOR CONVEX HULLS

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195991000025 ◽

1991 ◽

Vol 01 (01) ◽

pp. 1-22 ◽

Cited By ~ 19

Author(s):

LEONIDAS GUIBAS ◽

JOHN HERSHBERGER ◽

JACK SNOEYINK

Keyword(s):

Data Structure ◽

Convex Hull ◽

Compact Interval ◽

Common Tangent ◽

Convex Polygons ◽

Fixed Set ◽

Interval Tree ◽

Interval Trees ◽

Arrangements Of Lines ◽

The Common

In this paper, we investigate the problem of finding the common tangents of two convex polygons that intersect in two (unknown) points. First, we give a Θ( log 2n) bound for algorithms that store the polygons in independent arrays. Second, we show how to beat the lower bound if the vertices of the convex polygons are drawn from a fixed set of n points. We introduce a data structure called a compact interval tree that supports common tangent computations, as well as the standard binary-search-based queries, in O( log n) time apiece. Third, we apply compact interval trees to solve the subpath hull query problem: given a simple path, preprocess it so that we can find the convex hull of a query subpath quickly. With O(n log n) preprocessing, we can assemble a compact interval tree that represents the convex hull of a query subpath in O( log n) time. In order to represent arrangements of Lines implicitly, Edelsbrunner et al. used a less efficient structure, called bridge trees, to solve the subpath hull query problem. Our compact interval trees improve their results by a factor of O( log n). Thus, the present paper replaces the paper on bridge trees referred to by Edelsbrunner et al.

Download Full-text

epicontacts: Handling, visualisation and analysis of epidemiological contacts

F1000Research ◽

10.12688/f1000research.14492.1 ◽

2018 ◽

Vol 7 ◽

pp. 566 ◽

Cited By ~ 3

Author(s):

VP Nagraj ◽

Nistara Randhawa ◽

Finlay Campbell ◽

Thomas Crellen ◽

Bertrand Sudre ◽

...

Keyword(s):

Data Structure ◽

Network Analysis ◽

R Package ◽

Contact Tracing ◽

Single Object ◽

Line List ◽

Analysis Techniques ◽

Unique Data ◽

Link Type ◽

Interactive Visualisation

Epidemiological outbreak data is often captured in line list and contact format to facilitate contact tracing for outbreak control. epicontacts is an R package that provides a unique data structure for combining these data into a single object in order to facilitate more efficient visualisation and analysis. The package incorporates interactive visualisation functionality as well as network analysis techniques. Originally developed as part of the Hackout3 event, it is now developed, maintained and featured as part of the R Epidemics Consortium (RECON). The package is available for download from the Comprehensive R Archive Network (CRAN) and GitHub.

Download Full-text

GTC: a novel attempt to maintenance of huge genome collections compressed

10.1101/131649 ◽

2017 ◽

Author(s):

Agnieszka Danek ◽

Sebastian Deorowicz

Keyword(s):

Genetic Variation ◽

Data Structure ◽

Compression Ratio ◽

Link Type ◽

Variation Data ◽

Compressed Data

AbstractMotivationResultsWe present GTC, a novel compressed data structure for representation of huge collections of genetic variation data. GTC significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 thousand haplotypes at about 40 million SNPs can be stored in less than 4 Gbytes, while the queries related to variants are answered in a fraction of a second.AvailabilityGTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/[email protected]

Download Full-text

epicontacts: Handling, visualisation and analysis of epidemiological contacts

F1000Research ◽

10.12688/f1000research.14492.2 ◽

2018 ◽

Vol 7 ◽

pp. 566 ◽

Cited By ~ 4

Author(s):

VP Nagraj ◽

Nistara Randhawa ◽

Finlay Campbell ◽

Thomas Crellen ◽

Bertrand Sudre ◽

...

Keyword(s):

Data Structure ◽

Network Analysis ◽

R Package ◽

Contact Tracing ◽

Single Object ◽

Line List ◽

Analysis Techniques ◽

Unique Data ◽

Link Type ◽

Interactive Visualisation

Epidemiological outbreak data is often captured in line list and contact format to facilitate contact tracing for outbreak control. epicontacts is an R package that provides a unique data structure for combining these data into a single object in order to facilitate more efficient visualisation and analysis. The package incorporates interactive visualisation functionality as well as network analysis techniques. Originally developed as part of the Hackout3 event, it is now developed, maintained and featured as part of the R Epidemics Consortium (RECON). The package is available for download from the Comprehensive R Archive Network (CRAN) and GitHub.

Download Full-text