scholarly journals Bedtk: Finding Interval Overlap with Implicit Interval Tree

2020 ◽  
Author(s):  
Heng Li ◽  
Jiazhen Rong

AbstractSummaryWe present bedtk, a new toolkit for manipulating genomic intervals in the BED format. It supports sorting, merging, intersection, subtraction and the calculation of the breadth of coverage. Bedtk employs implicit interval tree, a new data structure for fast interval overlap queries. It is several to tens of times faster than existing tools and tends to use less memory.Availabilityhttps://github.com/lh3/[email protected], [email protected]

Author(s):  
Heng Li ◽  
Jiazhen Rong

Abstract Summary We present bedtk, a new toolkit for manipulating genomic intervals in the BED format. It supports sorting, merging, intersection, subtraction and the calculation of the breadth of coverage. Bedtk uses implicit interval tree, a data structure for fast interval overlap queries. It is several to tens of times faster than existing tools and tends to use less memory. Availability and implementation The source code is available at https://github.com/lh3/bedtk.


2017 ◽  
Author(s):  
Fatemeh Almodaresi ◽  
Hirak Sarkar ◽  
Rob Patro

AbstractWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search.Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing, carefully organizing our data structure, and making use of succinct representations where applicable, our data structure provides practically fast k-mer lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme built on the same underlying representation, which provides the ability to trade off k-mer query speed for a reduction in the de Bruijn graph index size. We believe this representation strikes a desirable balance between speed and space usage, and it will allow for fast search on large reference sequences.Pufferfish is developed in C++11, is open source (GPL v3), and is available at https://github.com/COMBINE-lab/Pufferfish. The scripts used to generate the results in this manuscript are available at https://github.com/COMBINE-lab/pufferfish_experiments.


2021 ◽  
Author(s):  
Kwangbom Choi ◽  
Matthew J. Vincent ◽  
Gary A. Churchill

AbstractSummaryThe abundance of genomic feature such as gene expression is often estimated from observed total number of alignment incidences in the targeted genome regions. We introduce a generic data structure and associated file format for alignment incidence data so that method developers can create novel pipelines comprising models, each optimal for read alignment, post-alignment QC, and quantification across multiple sequencing modalities.Availability and Implementationalntools software is freely available at https://github.com/churchill-lab/alntools under MIT [email protected] or [email protected]


1997 ◽  
Vol 07 (03) ◽  
pp. 165-175 ◽  
Author(s):  
Madhumangal Pal ◽  
G. P. Bhattacharjee

In this papar, a new data structure, interval tree (IT), is introduced for an interval graph. Some important properties of IT are studies from the algorithmic point of view. It has many advantages compared to the data structures which are commonly used to solve the problems on interval graphs. Using the properties of IT, the following problems are solved on interval graphs: (i) shortest distances between any two vertices, and (ii) the diameter of the graph.


2021 ◽  
Vol 12 (5) ◽  
Author(s):  
Josué Ttito ◽  
Renato Marroquín ◽  
Sergio Lifschitz ◽  
Lewis McGibbney ◽  
José Talavera

Key-value stores propose a straightforward yet powerful data model. Data is modeled using key-value pairs where values can be arbitrary objects and written/read using the key associated with it. In addition to their simple interface, such data stores also provide read operations such as full and range scans. However, due to the simplicity of its interface, trying to optimize data accesses becomes challenging. This work aims to enable the shared execution of concurrent range and point queries on key-value stores. Thus, reducing the overall data movement when executing a complete workload. To accomplish this, we analyze different possible data structures and propose our variation of a segment tree, Updatable Interval Tree. Our data structure helps us co-planning and co-executing multiple range queries together and reduces redundant work. This results in executing workloads more efficiently and overall increased throughput, as we show in our evaluation.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 145
Author(s):  
Tamer Gur

Due to their nature, bioinformatics datasets are often closely related to each other. For this reason, search, mapping and visualization of these relations are often performed manually or programmatically via identifiers or special keywords such as gene symbols. Although various tools exist for these situations, the growing volume of bioinformatics datasets, emerging new software tools and approaches motivates new solutions. To provide a new tool for these current cases, I present the Biobtree bioinformatics tool. Biobtree effectively fetches and indexes identifiers and special keywords with their related identifiers from supported datasets, optionally with user pre-defined datasets and provides a web interface, web services and direct B+ tree data structure based single uniform database output. Biobtree can handle billions of identifiers and runs via a single executable file with no installation and dependency required. It also aims to provide a relatively small codebase for easy maintenance, addition of new features and extension to larger datasets. Biobtree is available to download from GitHub.


Author(s):  
LEONIDAS GUIBAS ◽  
JOHN HERSHBERGER ◽  
JACK SNOEYINK

In this paper, we investigate the problem of finding the common tangents of two convex polygons that intersect in two (unknown) points. First, we give a Θ( log 2n) bound for algorithms that store the polygons in independent arrays. Second, we show how to beat the lower bound if the vertices of the convex polygons are drawn from a fixed set of n points. We introduce a data structure called a compact interval tree that supports common tangent computations, as well as the standard binary-search-based queries, in O( log n) time apiece. Third, we apply compact interval trees to solve the subpath hull query problem: given a simple path, preprocess it so that we can find the convex hull of a query subpath quickly. With O(n log n) preprocessing, we can assemble a compact interval tree that represents the convex hull of a query subpath in O( log n) time. In order to represent arrangements of Lines implicitly, Edelsbrunner et al. used a less efficient structure, called bridge trees, to solve the subpath hull query problem. Our compact interval trees improve their results by a factor of O( log n). Thus, the present paper replaces the paper on bridge trees referred to by Edelsbrunner et al.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 566 ◽  
Author(s):  
VP Nagraj ◽  
Nistara Randhawa ◽  
Finlay Campbell ◽  
Thomas Crellen ◽  
Bertrand Sudre ◽  
...  

Epidemiological outbreak data is often captured in line list and contact format to facilitate contact tracing for outbreak control. epicontacts is an R package that provides a unique data structure for combining these data into a single object in order to facilitate more efficient visualisation and analysis. The package incorporates interactive visualisation functionality as well as network analysis techniques. Originally developed as part of the Hackout3 event, it is now developed, maintained and featured as part of the R Epidemics Consortium (RECON). The package is available for download from the Comprehensive R Archive Network (CRAN) and GitHub.


2017 ◽  
Author(s):  
Agnieszka Danek ◽  
Sebastian Deorowicz

AbstractMotivationResultsWe present GTC, a novel compressed data structure for representation of huge collections of genetic variation data. GTC significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 thousand haplotypes at about 40 million SNPs can be stored in less than 4 Gbytes, while the queries related to variants are answered in a fraction of a second.AvailabilityGTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/[email protected]


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 566 ◽  
Author(s):  
VP Nagraj ◽  
Nistara Randhawa ◽  
Finlay Campbell ◽  
Thomas Crellen ◽  
Bertrand Sudre ◽  
...  

Epidemiological outbreak data is often captured in line list and contact format to facilitate contact tracing for outbreak control. epicontacts is an R package that provides a unique data structure for combining these data into a single object in order to facilitate more efficient visualisation and analysis. The package incorporates interactive visualisation functionality as well as network analysis techniques. Originally developed as part of the Hackout3 event, it is now developed, maintained and featured as part of the R Epidemics Consortium (RECON). The package is available for download from the Comprehensive R Archive Network (CRAN) and GitHub.


Sign in / Sign up

Export Citation Format

Share Document