scholarly journals Augmented Interval List: a novel data structure for efficient genomic interval search

2019 ◽  
Vol 35 (23) ◽  
pp. 4907-4911 ◽  
Author(s):  
Jianglin Feng ◽  
Aakrosh Ratan ◽  
Nathan C Sheffield

Abstract Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability and implementation An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Jianglin Feng ◽  
Aakrosh Ratan ◽  
Nathan C. Sheffield

AbstractMotivationGenomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary.ResultsWe present a new data structure, the augmented interval list (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N + n + m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5 - 18 times faster than standard high-performance code based on augmented interval-trees (AITree), nested containment lists (NCList), or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4% - 60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis.AvailabilityAn implementation of the AIList data structure with both construction and search algorithms is available at code.databio.org/AIList.


2018 ◽  
Author(s):  
Stuart Lee ◽  
Dianne Cook ◽  
Michael Lawrence

The Bioconductor project provides many interoperable data abstractions for analyzing high-throughput genomics experiments; however implementing a typical genomic workflow with Bioconductor requires learning these abstractions and understanding them at an integrative level. This places a large cognitive burden on the user, especially for non-programmers. To reduce this burden we have created a grammar of genomic data transformation that operates on a single, central Bioconductor data structure, GRanges, which naturally represents genomic intervals and their associated measurements. The grammar defines verbs for performing actions on and between genomic interval data through a simplified, coherent interface to existing Bioconductor infrastructure, resulting in fluent analysis workflows. We have implemented this grammar as an R/Bioconductor package called plyranges.


Author(s):  
Jianglin Feng ◽  
Nathan C Sheffield

Abstract Summary Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. Availability https://github.com/databio/IGD


Author(s):  
Jianglin Feng ◽  
Nathan C. Sheffield

SummaryDatabases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.Availabilityhttps://github.com/databio/IGD


Author(s):  
Divon Lan ◽  
Ray Tobler ◽  
Yassine Souilmi ◽  
Bastien Llamas

Abstract We present Genozip, a universal and fully featured compression software for genomic data. Genozip is designed to be a general-purpose software and a development framework for genomic compression by providing five core capabilities – universality (support for all common genomic file formats), high compression ratios, speed, feature-richness, and extensibility. Genozip delivers high-performance compression for widely-used genomic data formats in genomics research, namely FASTQ, SAM/BAM/CRAM, VCF, GVF, FASTA, PHYLIP, and 23andMe formats. Our test results show that Genozip is fast and achieves greatly improved compression ratios, even when the files are already compressed. Further, Genozip is architected with a separation of the Genozip Framework from file-format-specific Segmenters and data-type-specific Codecs. With this, we intend for Genozip to be a general-purpose compression platform where researchers can implement compression for additional file formats, as well as new codecs for data types or fields within files, in the future. We anticipate that this will ultimately increase the visibility and adoption of these algorithms by the user community, thereby accelerating further innovation in this space. Availability: Genozip is written in C. The code is open-source and available on GitHub (https://github.com/divonlan/genozip). The package is free for non-commercial use. It is distributed as a Docker container on DockerHub and through the conda package manager. Genozip is tested on Linux, Mac, and Windows. Supplementary information: Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Brent S. Pedersen ◽  
Aaron R. Quinlan

AbstractMotivationExtracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets.ResultsWe present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance.Availabilityhts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT [email protected] informationSupplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (14) ◽  
pp. 2512-2514 ◽  
Author(s):  
Bongsong Kim ◽  
Xinbin Dai ◽  
Wenchao Zhang ◽  
Zhaohong Zhuang ◽  
Darlene L Sanchez ◽  
...  

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (24) ◽  
pp. 5339-5340 ◽  
Author(s):  
Laura Puente-Santamaria ◽  
Wyeth W Wasserman ◽  
Luis del Peso

Abstract Summary The computational identification of the transcription factors (TFs) [more generally, transcription regulators, (TR)] responsible for the co-regulation of a specific set of genes is a common problem found in genomic analysis. Herein, we describe TFEA.ChIP, a tool that makes use of ChIP-seq datasets to estimate and visualize TR enrichment in gene lists representing transcriptional profiles. We validated TFEA.ChIP using a wide variety of gene sets representing signatures of genetic and chemical perturbations as input and found that the relevant TR was correctly identified in 126 of a total of 174 analyzed. Comparison with other TR enrichment tools demonstrates that TFEA.ChIP is an highly customizable package with an outstanding performance. Availability and implementation TFEA.ChIP is implemented as an R package available at Bioconductor https://www.bioconductor.org/packages/devel/bioc/html/TFEA.ChIP.html and github https://github.com/LauraPS1/TFEA.ChIP_downloads. A web-based GUI to the package is also available at https://www.iib.uam.es/TFEA.ChIP/ Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (13) ◽  
pp. 2338-2339 ◽  
Author(s):  
Hongyang Li ◽  
Shuai Hu ◽  
Nouri Neamati ◽  
Yuanfang Guan

Abstract Motivation Combination therapy is widely used in cancer treatment to overcome drug resistance. High-throughput drug screening is the standard approach to study the drug combination effects, yet it becomes impractical when the number of drugs under consideration is large. Therefore, accurate and fast computational tools for predicting drug synergistic effects are needed to guide experimental design for developing candidate drug pairs. Results Here, we present TAIJI, a high-performance software for fast and accurate prediction of drug synergism. It is based on the winning algorithm in the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge, which is a unique platform to unbiasedly evaluate the performance of current state-of-the-art methods, and includes 160 team-based submission methods. When tested across a broad spectrum of 85 different cancer cell lines and 1089 drug combinations, TAIJI achieved a high prediction correlation (0.53), approaching the accuracy level of experimental replicates (0.56). The runtime is at the scale of minutes to achieve this state-of-the-field performance. Availability and implementation TAIJI is freely available on GitHub (https://github.com/GuanLab/TAIJI). It is functional with built-in Perl and Python. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (16) ◽  
pp. 2843-2846 ◽  
Author(s):  
Hung Nguyen ◽  
Sangam Shrestha ◽  
Sorin Draghici ◽  
Tin Nguyen

Abstract Summary Since cancer is a heterogeneous disease, tumor subtyping is crucial for improved treatment and prognosis. We have developed a subtype discovery tool, called PINSPlus, that is: (i) robust against noise and unstable quantitative assays, (ii) able to integrate multiple types of omics data in a single analysis and (iii) dramatically superior to established approaches in identifying known subtypes and novel subgroups with significant survival differences. Our validation on 12,158 samples from 44 datasets shows that PINSPlus vastly outperforms other approaches. The software is easy-to-use and can partition hundreds of patients in a few minutes on a personal computer. Availability and implementation The package is available at https://cran.r-project.org/package=PINSPlus. Data and R script used in this manuscript are available at https://bioinformatics.cse.unr.edu/software/PINSPlus/. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document