hts-nim: scripting high-performance genomic analyses

Abstract Motivation Extracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets. Results We present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance. Availability and implementation hts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT license.

Download Full-text

hts-nim: scripting high-performance genomic analyses

10.1101/261735 ◽

2018 ◽

Author(s):

Brent S. Pedersen ◽

Aaron R. Quinlan

Keyword(s):

High Performance ◽

Genomic Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scripting Languages ◽

Link Type ◽

Custom Software ◽

Genomic Analyses ◽

Biological Insight ◽

Supplementary Material

AbstractMotivationExtracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets.ResultsWe present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance.Availabilityhts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

clipplotr - a comparative visualisation and analysis tool for CLIP data

10.1101/2021.09.10.459763 ◽

2021 ◽

Author(s):

Anob M Chakrabarti ◽

Charlotte Capitanchik ◽

Jernej Ule ◽

Nicholas Luscombe

Keyword(s):

Protein Interactions ◽

High Performance ◽

Genomic Data ◽

Analysis Tool ◽

Functional Genomic ◽

Functional Genomic Data ◽

Laptop Computer ◽

Data Repositories ◽

Command Line Tool ◽

Biological Insight

CLIP technologies are now widely used to study RNA-protein interactions and many datasets are now publicly available. An important first step in CLIP data exploration is the visual inspection and assessment of processed genomic data on selected genes or regions and performing comparisons: either across conditions within a particular project, or incorporating publicly available data. However, the output files produced by data processing pipelines or preprocessed files available to download from data repositories are often not suitable for direct comparison and usually need further processing. Furthermore, to derive biological insight it is usually necessary to visualise CLIP signal alongside other data such as annotations, or orthogonal functional genomic data (e.g. RNA-seq). We have developed a simple, but powerful, command-line tool: clipplotr, which facilitates these visual comparative and integrative analyses with normalisation and smoothing options for CLIP data and the ability to show these alongside reference annotation tracks and functional genomic data. These data can be supplied as input to clipplotr in a range of file formats, which will output a publication quality figure. It is written in R and can both run on a laptop computer independently, or be integrated into computational workflows on a high-performance cluster. Releases, source code and documentation are freely available at: https://github.com/ulelab/clipplotr.

Download Full-text

Accurate Filtering of Privacy-Sensitive Information in Raw Genomic Data

10.1101/292185 ◽

2018 ◽

Author(s):

Jérémie Decouchant ◽

Maria Fernandes ◽

Marcus Völp ◽

Francisco M Couto ◽

Paulo Esteves-Veríssimo

Keyword(s):

High Performance ◽

Genomic Data ◽

Sensitive Information ◽

Sensitive Data ◽

Variable Regions ◽

Fine Grained ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads

AbstractSequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.

Download Full-text

MaTX: a high-performance programming language (interpreter and compiler) for scientific and engineering computation

10.1109/cacsd.1992.274414 ◽

2003 ◽

Cited By ~ 8

Author(s):

M. Koga ◽

K. Furuta

Keyword(s):

Programming Language ◽

High Performance ◽

Engineering Computation ◽

Performance Programming

Download Full-text

A High Performance Storage Appliance for Genomic Data

Bioinformatics and Biomedical Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-319-56154-7_43 ◽

2017 ◽

pp. 480-488 ◽

Cited By ~ 2

Author(s):

Gaurav Kaul ◽

Zeeshan Ali Shah ◽

Mohamed Abouelhoda

Keyword(s):

High Performance ◽

Genomic Data

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

Bioinformatics ◽

10.1093/bioinformatics/btz407 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4907-4911 ◽

Cited By ~ 8

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Supplementary Information ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Scalable Methods

Abstract Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability and implementation An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoreCruncher: Fast and Robust Construction of Core Genomes in Large Prokaryotic Data Sets

Molecular Biology and Evolution ◽

10.1093/molbev/msaa224 ◽

2020 ◽

Author(s):

Connor D Harris ◽

Ellis L Torrance ◽

Kasie Raymann ◽

Louis-Marie Bobay

Keyword(s):

Core Genome ◽

Genomic Data ◽

Data Sets ◽

The Core ◽

Genomic Analyses ◽

Massive Accumulation ◽

Genome Comparisons

Abstract The core genome represents the set of genes shared by all, or nearly all, strains of a given population or species of prokaryotes. Inferring the core genome is integral to many genomic analyses, however, most methods rely on the comparison of all the pairs of genomes; a step that is becoming increasingly difficult given the massive accumulation of genomic data. Here, we present CoreCruncher; a program that robustly and rapidly constructs core genomes across hundreds or thousands of genomes. CoreCruncher does not compute all pairwise genome comparisons and uses a heuristic based on the distributions of identity scores to classify sequences as orthologs or paralogs/xenologs. Although it is much faster than current methods, our results indicate that our approach is more conservative than other tools and less sensitive to the presence of paralogs and xenologs. CoreCruncher is freely available from: https://github.com/lbobay/CoreCruncher. CoreCruncher is written in Python 3.7 and can also run on Python 2.7 without modification. It requires the python library Numpy and either Usearch or Blast. Certain options require the programs muscle or mafft.

Download Full-text

ScalaLab and GroovyLab: Comparing Scala and Groovy for Scientific Computing

Scientific Programming ◽

10.1155/2015/498618 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13

Author(s):

Stergios Papadimitriou ◽

Kirsten Schwark ◽

Seferina Mavroudi ◽

Kostas Theofilatos ◽

Spiridon Likothanasis

Keyword(s):

Programming Language ◽

User Interfaces ◽

High Performance ◽

Scientific Computing ◽

Point Of View ◽

Java Virtual Machine ◽

Programming Environment ◽

Fast Pace ◽

Native Code ◽

Performance Computing

ScalaLab and GroovyLab are both MATLAB-like environments for the Java Virtual Machine. ScalaLab is based on the Scala programming language and GroovyLab is based on the Groovy programming language. They present similar user interfaces and functionality to the user. They also share the same set of Java scientific libraries and of native code libraries. From the programmer's point of view though, they have significant differences. This paper compares some aspects of the two environments and highlights some of the strengths and weaknesses of Scala versus Groovy for scientific computing. The discussion also examines some aspects of the dilemma of using dynamic typing versus static typing for scientific programming. The performance of the Java platform is continuously improved at a fast pace. Today Java can effectively support demanding high-performance computing and scales well on multicore platforms. Thus, both systems can challenge the performance of the traditional C/C++/Fortran scientific code with an easier to use and more productive programming environment.

Download Full-text

HiGene: A high-performance platform for genomic data analysis

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2016.7822584 ◽

2016 ◽

Cited By ~ 6

Author(s):

Liqun Deng ◽

Guowei Huang ◽

Yuzheng Zhuang ◽

Jiansheng Wei ◽

Youliang Yan

Keyword(s):

Data Analysis ◽

High Performance ◽

Genomic Data ◽

Genomic Data Analysis

Download Full-text

USING METAPROGRAMMING TO PARALLELIZE FUNCTIONAL SPECIFICATIONS

Parallel Processing Letters ◽

10.1142/s0129626402000926 ◽

2002 ◽

Vol 12 (02) ◽

pp. 193-210 ◽

Cited By ~ 3

Author(s):

CHRISTOPH A. HERRMANN ◽

CHRISTIAN LENGAUER

Keyword(s):

Parallel Computing ◽

Programming Language ◽

High Performance ◽

Parallel Implementation ◽

General Purpose ◽

Functional Language ◽

Application Domain ◽

Domain Specific ◽

Domain Independent

Metaprogramming is a paradigm for enhancing a general-purpose programming language with features catering for a special-purpose application domain, without a need for a reimplementation of the language. In a staged compilation, the special-purpose features are translated and optimised by a domain-specific preprocessor, which hands over to the general-purpose compiler for translation of the domain-independent part of the program. The domain we work in is high-performance parallel computing. We use metaprogramming to enhance the functional language Haskell with features for the efficient, parallel implementation of certain computational patterns, called skeletons.

Download Full-text