Augmented Interval List: a novel data structure for efficient genomic interval search

Jianglin Feng; Aakrosh Ratan; Nathan C Sheffield

doi:10.1093/bioinformatics/btz407

Augmented Interval List: a novel data structure for efficient genomic interval search

Bioinformatics ◽

10.1093/bioinformatics/btz407 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4907-4911 ◽

Cited By ~ 8

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Supplementary Information ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Scalable Methods

Abstract Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability and implementation An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

10.1101/593657 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C. Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Genomic Data Analysis ◽

Scalable Methods

AbstractMotivationGenomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary.ResultsWe present a new data structure, the augmented interval list (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N + n + m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5 - 18 times faster than standard high-performance code based on augmented interval-trees (AITree), nested containment lists (NCList), or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4% - 60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis.AvailabilityAn implementation of the AIList data structure with both construction and search algorithms is available at code.databio.org/AIList.

Download Full-text

plyranges: A grammar of genomic data transformation

10.1101/327841 ◽

2018 ◽

Author(s):

Stuart Lee ◽

Dianne Cook ◽

Michael Lawrence

Keyword(s):

Data Structure ◽

High Throughput ◽

Genomic Data ◽

Data Transformation ◽

Interval Data ◽

Bioconductor Package ◽

Coherent Interface ◽

Bioconductor Project ◽

Genomic Interval ◽

Integrative Level

The Bioconductor project provides many interoperable data abstractions for analyzing high-throughput genomics experiments; however implementing a typical genomic workflow with Bioconductor requires learning these abstractions and understanding them at an integrative level. This places a large cognitive burden on the user, especially for non-programmers. To reduce this burden we have created a grammar of genomic data transformation that operates on a single, central Bioconductor data structure, GRanges, which naturally represents genomic intervals and their associated measurements. The grammar defines verbs for performing actions on and between genomic interval data through a simplified, coherent interface to existing Bioconductor infrastructure, resulting in fluent analysis workflows. We have implemented this grammar as an R/Bioconductor package called plyranges.

Download Full-text

IGD: high-performance search for large-scale genomic interval datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa1062 ◽

2020 ◽

Author(s):

Jianglin Feng ◽

Nathan C Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

Abstract Summary Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. Availability https://github.com/databio/IGD

Download Full-text

IGD: high-performance search for large-scale genomic interval datasets

10.1101/2020.06.08.139758 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jianglin Feng ◽

Nathan C. Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Link Type ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

SummaryDatabases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.Availabilityhttps://github.com/databio/IGD

Download Full-text

Genozip - A Universal Extensible Genomic Data Compressor

Bioinformatics ◽

10.1093/bioinformatics/btab102 ◽

2021 ◽

Author(s):

Divon Lan ◽

Ray Tobler ◽

Yassine Souilmi ◽

Bastien Llamas

Keyword(s):

High Performance ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Data Types ◽

Data Formats ◽

Development Framework ◽

File Formats ◽

Genomics Research ◽

Core Capabilities

Abstract We present Genozip, a universal and fully featured compression software for genomic data. Genozip is designed to be a general-purpose software and a development framework for genomic compression by providing five core capabilities – universality (support for all common genomic file formats), high compression ratios, speed, feature-richness, and extensibility. Genozip delivers high-performance compression for widely-used genomic data formats in genomics research, namely FASTQ, SAM/BAM/CRAM, VCF, GVF, FASTA, PHYLIP, and 23andMe formats. Our test results show that Genozip is fast and achieves greatly improved compression ratios, even when the files are already compressed. Further, Genozip is architected with a separation of the Genozip Framework from file-format-specific Segmenters and data-type-specific Codecs. With this, we intend for Genozip to be a general-purpose compression platform where researchers can implement compression for additional file formats, as well as new codecs for data types or fields within files, in the future. We anticipate that this will ultimately increase the visibility and adoption of these algorithms by the user community, thereby accelerating further innovation in this space. Availability: Genozip is written in C. The code is open-source and available on GitHub (https://github.com/divonlan/genozip). The package is free for non-commercial use. It is distributed as a Docker container on DockerHub and through the conda package manager. Genozip is tested on Linux, Mac, and Windows. Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

hts-nim: scripting high-performance genomic analyses

10.1101/261735 ◽

2018 ◽

Author(s):

Brent S. Pedersen ◽

Aaron R. Quinlan

Keyword(s):

High Performance ◽

Genomic Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scripting Languages ◽

Link Type ◽

Custom Software ◽

Genomic Analyses ◽

Biological Insight ◽

Supplementary Material

AbstractMotivationExtracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets.ResultsWe present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance.Availabilityhts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

GWASpro: a high-performance genome-wide association analysis server

Bioinformatics ◽

10.1093/bioinformatics/bty989 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2512-2514 ◽

Cited By ~ 4

Author(s):

Bongsong Kim ◽

Xinbin Dai ◽

Wenchao Zhang ◽

Zhaohong Zhuang ◽

Darlene L Sanchez ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Linear Mixed Model ◽

Association Studies ◽

Learning Curves ◽

Experimental Designs ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Genome Wide

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TFEA.ChIP: a tool kit for transcription factor binding site enrichment analysis capitalizing on ChIP-seq datasets

Bioinformatics ◽

10.1093/bioinformatics/btz573 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5339-5340 ◽

Cited By ~ 8

Author(s):

Laura Puente-Santamaria ◽

Wyeth W Wasserman ◽

Luis del Peso

Keyword(s):

Genomic Analysis ◽

Enrichment Analysis ◽

R Package ◽

Supplementary Information ◽

Web Based ◽

Factor Binding Site ◽

Gene Sets ◽

Transcription Regulators ◽

Computational Identification ◽

On Chip

Abstract Summary The computational identification of the transcription factors (TFs) [more generally, transcription regulators, (TR)] responsible for the co-regulation of a specific set of genes is a common problem found in genomic analysis. Herein, we describe TFEA.ChIP, a tool that makes use of ChIP-seq datasets to estimate and visualize TR enrichment in gene lists representing transcriptional profiles. We validated TFEA.ChIP using a wide variety of gene sets representing signatures of genetic and chemical perturbations as input and found that the relevant TR was correctly identified in 126 of a total of 174 analyzed. Comparison with other TR enrichment tools demonstrates that TFEA.ChIP is an highly customizable package with an outstanding performance. Availability and implementation TFEA.ChIP is implemented as an R package available at Bioconductor https://www.bioconductor.org/packages/devel/bioc/html/TFEA.ChIP.html and github https://github.com/LauraPS1/TFEA.ChIP_downloads. A web-based GUI to the package is also available at https://www.iib.uam.es/TFEA.ChIP/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TAIJI: approaching experimental replicates-level accuracy for drug synergy prediction

Bioinformatics ◽

10.1093/bioinformatics/bty955 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2338-2339 ◽

Cited By ~ 5

Author(s):

Hongyang Li ◽

Shuai Hu ◽

Nouri Neamati ◽

Yuanfang Guan

Keyword(s):

Drug Combination ◽

High Performance ◽

Synergistic Effects ◽

Field Performance ◽

Supplementary Information ◽

Drug Synergism ◽

High Throughput Drug Screening ◽

Current State ◽

Candidate Drug ◽

High Prediction

Abstract Motivation Combination therapy is widely used in cancer treatment to overcome drug resistance. High-throughput drug screening is the standard approach to study the drug combination effects, yet it becomes impractical when the number of drugs under consideration is large. Therefore, accurate and fast computational tools for predicting drug synergistic effects are needed to guide experimental design for developing candidate drug pairs. Results Here, we present TAIJI, a high-performance software for fast and accurate prediction of drug synergism. It is based on the winning algorithm in the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge, which is a unique platform to unbiasedly evaluate the performance of current state-of-the-art methods, and includes 160 team-based submission methods. When tested across a broad spectrum of 85 different cancer cell lines and 1089 drug combinations, TAIJI achieved a high prediction correlation (0.53), approaching the accuracy level of experimental replicates (0.56). The runtime is at the scale of minutes to achieve this state-of-the-field performance. Availability and implementation TAIJI is freely available on GitHub (https://github.com/GuanLab/TAIJI). It is functional with built-in Perl and Python. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PINSPlus: a tool for tumor subtype discovery in integrated genomic data

Bioinformatics ◽

10.1093/bioinformatics/bty1049 ◽

2018 ◽

Vol 35 (16) ◽

pp. 2843-2846 ◽

Cited By ~ 15

Author(s):

Hung Nguyen ◽

Sangam Shrestha ◽

Sorin Draghici ◽

Tin Nguyen

Keyword(s):

Personal Computer ◽

Genomic Data ◽

Supplementary Information ◽

Omics Data ◽

Tumor Subtype ◽

Supplementary Data ◽

Significant Survival ◽

Survival Differences

Abstract Summary Since cancer is a heterogeneous disease, tumor subtyping is crucial for improved treatment and prognosis. We have developed a subtype discovery tool, called PINSPlus, that is: (i) robust against noise and unstable quantitative assays, (ii) able to integrate multiple types of omics data in a single analysis and (iii) dramatically superior to established approaches in identifying known subtypes and novel subgroups with significant survival differences. Our validation on 12,158 samples from 44 datasets shows that PINSPlus vastly outperforms other approaches. The software is easy-to-use and can partition hundreds of patients in a few minutes on a personal computer. Availability and implementation The package is available at https://cran.r-project.org/package=PINSPlus. Data and R script used in this manuscript are available at https://bioinformatics.cse.unr.edu/software/PINSPlus/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text