A Python-based optimization framework for high-performance genomics

AbstractExponentially-growing next-generation sequencing data requires high-performance tools and algorithms. Nevertheless, the implementation of high-performance computational genomics software is inaccessible to many scientists because it requires extensive knowledge of low-level software optimization techniques, forcing scientists to resort to high-level software alternatives that are less efficient. Here, we introduce Seq—a Python-based optimization framework that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. We showcase and evaluate Seq by implementing seven standard, widely-used applications from all stages of the genomics analysis pipeline, including genome index construction, finding maximal exact matches, long-read alignment and haplotype phasing, and demonstrate its implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq further opens the door to the democratization and scalability of computational genomics.

Download Full-text

High-Level Parallel Ant Colony Optimization with Algorithmic Skeletons

International Journal of Parallel Programming ◽

10.1007/s10766-021-00714-1 ◽

2021 ◽

Author(s):

Breno A. de Melo Menezes ◽

Nina Herrmann ◽

Herbert Kuchen ◽

Fernando Buarque de Lima Neto

Keyword(s):

Ant Colony Optimization ◽

High Performance ◽

Optimization Problems ◽

Programming Model ◽

Parallel Implementation ◽

Ant Colony ◽

Algorithmic Skeletons ◽

Low Level ◽

Programming Patterns ◽

High Level

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

Download Full-text

Implementation of Scientific Computing Applications on the Cell Broadband Engine

Scientific Programming ◽

10.1155/2009/589561 ◽

2009 ◽

Vol 17 (1-2) ◽

pp. 135-151 ◽

Cited By ~ 6

Author(s):

Guochun Shi ◽

Volodymyr V. Kindratenko ◽

Ivan S. Ufimtsev ◽

Todd J. Martinez ◽

James C. Phillips ◽

...

Keyword(s):

High Performance ◽

Scientific Computing ◽

Lessons Learned ◽

Optimization Techniques ◽

Cell Processor ◽

Intrinsic Properties ◽

Cell Broadband Engine ◽

Performance Improvements ◽

Cell Architecture ◽

Practical Recommendations

The Cell Broadband Engine architecture is a revolutionary processor architecture well suited for many scientific codes. This paper reports on an effort to implement several traditional high-performance scientific computing applications on the Cell Broadband Engine processor, including molecular dynamics, quantum chromodynamics and quantum chemistry codes. The paper discusses data and code restructuring strategies necessary to adapt the applications to the intrinsic properties of the Cell processor and demonstrates performance improvements achieved on the Cell architecture. It concludes with the lessons learned and provides practical recommendations on optimization techniques that are believed to be most appropriate.

Download Full-text

PathoQC: Computationally Efficient Read Preprocessing and Quality Control for High-Throughput Sequencing Data Sets

Cancer Informatics ◽

10.4137/cin.s13890 ◽

2014 ◽

Vol 13s1 ◽

pp. CIN.S13890 ◽

Cited By ~ 1

Author(s):

Changjin Hong ◽

Solaiappan Manimaran ◽

William Evan Johnson

Keyword(s):

Quality Control ◽

High Throughput ◽

High Performance ◽

High Throughput Sequencing ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Sequencing Data ◽

Computationally Efficient ◽

High Throughput Sequencing Data ◽

Downstream Analysis

Quality control and read preprocessing are critical steps in the analysis of data sets generated from high-throughput genomic screens. In the most extreme cases, improper preprocessing can negatively affect downstream analyses and may lead to incorrect biological conclusions. Here, we present PathoQC, a streamlined toolkit that seamlessly combines the benefits of several popular quality control software approaches for preprocessing next-generation sequencing data. PathoQC provides a variety of quality control options appropriate for most high-throughput sequencing applications. PathoQC is primarily developed as a module in the PathoScope software suite for metagenomic analysis. However, PathoQC is also available as an open-source Python module that can run as a stand-alone application or can be easily integrated into any bioinformatics workflow. PathoQC achieves high performance by supporting parallel computation and is an effective tool that removes technical sequencing artifacts and facilitates robust downstream analysis. The PathoQC software package is available at http://sourceforge.net/projects/PathoScope/ .

Download Full-text

Apache Nemo: A Framework for Optimizing Distributed Data Processing

ACM Transactions on Computer Systems ◽

10.1145/3468144 ◽

2020 ◽

Vol 38 (3-4) ◽

pp. 1-31

Author(s):

Won Wook Song ◽

Youngseok Yang ◽

Jeongyoon Eo ◽

Jangho Seo ◽

Joo Yeon Kim ◽

...

Keyword(s):

Data Processing ◽

High Performance ◽

Programming Model ◽

Compiler Optimization ◽

Ease Of Use ◽

Distributed Data ◽

Performance Improvements ◽

Distributed Data Processing ◽

Fine Control ◽

High Level

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.

Download Full-text

EM3DANI: A Julia package for fully anisotropic 3D forward modeling of electromagnetic data

Geophysics ◽

10.1190/geo2020-0489.1 ◽

2021 ◽

pp. 1-45

Author(s):

Ronghua Peng ◽

Bo Han ◽

Yajun Liu ◽

Xiangyun Hu

Keyword(s):

High Performance ◽

Three Dimensional ◽

Anisotropic Media ◽

Forward Modeling ◽

Hydrocarbon Exploration ◽

Third Party ◽

Low Level ◽

Numerical Computing ◽

Computationally Intensive ◽

High Level

Forward modeling is vital for three-dimensional (3D) inversion and interpretation of electromagnetic (EM) data in anisotropic media, which is one of the major challenges in the field of EM geophysics. However, there are few freely available 3D codes that are capable of modeling EM responses in fully anisotropic media. Besides, most of the existing 3D EM codes are written in low-level languages such as C and Fortran, making them difficult to read, maintain and extend. Taking advantage of recent progress in computer technology and numerical methods, we have developed an open-source package for forward modeling of frequency-domain EM fields in a fully 3D anisotropic earth (EM3DANI) using the Julia language, a relatively young, high-level programming language with a focus on high performance. Based on a mimetic finite-volume (MFV) discretization of the governing equations, the modeling algorithm is expressed in an abstract form in terms of matrices/vectors and thus can be easily implemented by using any high-level language commonly-used for numerical computing. Existing libraries written in low-level languages can be easily integrated into a Julia code without the so-called two-language problem, thus we have exploited several mature third-party packages to deal with computationally intensive parts of the forward modeling, which guarantees high stability and efficiency. We have elaborated the structure of the package, paying special attention to code usability, readability and extendability, while striving to retain versatility and high performance. The effectiveness of the code is demonstrated through two 1D synthetic examples for magnetotellurics (MT) and controlled-source electromagnetics (CSEM) problems, respectively. High accuracy and efficiency can be achieved for both 1D examples. We further present a 3D example mimicking marine CSEM survey scenario for hydrocarbon exploration. The simulation results indicate that the effect of the anisotropy on forward responses is significant, and can be comparable to that of the target reservoir.

Download Full-text

Scientific Computing on the Itanium® Processor

Scientific Programming ◽

10.1155/2002/193478 ◽

2002 ◽

Vol 10 (4) ◽

pp. 329-337 ◽

Cited By ~ 2

Author(s):

Bruce Greer ◽

John Harrison ◽

Greg Henry ◽

Wei Li ◽

Peter Tang

Keyword(s):

Linear Algebra ◽

High Performance ◽

Scientific Computing ◽

Peak Performance ◽

Low Level ◽

Architectural Features ◽

Transcendental Functions ◽

High Performance Computer ◽

High Level ◽

Engineering Computing

The 64-bit Intel® Itanium® architecture is designed for high-performance scientific and enterprise computing, and the Itanium processor is its first silicon implementation. Features such as extensive arithmetic support, predication, speculation, and explicit parallelism can be used to provide a sound infrastructure for supercomputing. A large number of high-performance computer companies are offering Itanium® -based systems, some capable of peak performance exceeding 50 GFLOPS. In this paper we give an overview of the most relevant architectural features and provide illustrations of how these features are used in both low-level and high-level support for scientific and engineering computing, including transcendental functions and linear algebra kernels.

Download Full-text

AutoMap is a high performance homozygosity mapping tool using next-generation sequencing data

Nature Communications ◽

10.1038/s41467-020-20584-4 ◽

2021 ◽

Vol 12 (1) ◽

Cited By ~ 2

Author(s):

Mathieu Quinodoz ◽

Virginie G. Peter ◽

Nicola Bedoni ◽

Béryl Royer Bertrand ◽

Katarina Cisarova ◽

...

Keyword(s):

Molecular Diagnosis ◽

High Performance ◽

Homozygosity Mapping ◽

Medical Genetics ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Dna Microsatellites ◽

Isolated Populations ◽

Mapping Tool ◽

Research Activities

AbstractHomozygosity mapping is a powerful method for identifying mutations in patients with recessive conditions, especially in consanguineous families or isolated populations. Historically, it has been used in conjunction with genotypes from highly polymorphic markers, such as DNA microsatellites or common SNPs. Traditional software performs rather poorly with data from Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS), which are now extensively used in medical genetics. We develop AutoMap, a tool that is both web-based or downloadable, to allow performing homozygosity mapping directly on VCF (Variant Call Format) calls from WES or WGS projects. Following a training step on WES data from 26 consanguineous families and a validation procedure on a matched cohort, our method shows higher overall performances when compared with eight existing tools. Most importantly, when tested on real cases with negative molecular diagnosis from an internal set, AutoMap detects three gene-disease and multiple variant-disease associations that were previously unrecognized, projecting clear benefits for both molecular diagnosis and research activities in medical genetics.

Download Full-text

easyfm : An easy software suite for file manipulation of Next Generation Sequencing data on desktops

10.1101/2021.09.29.462291 ◽

2021 ◽

Author(s):

Hyungtaek Jung ◽

Brendan Jeon ◽

Daniel Ortiz-Barrientos

Keyword(s):

Next Generation Sequencing ◽

High Performance ◽

Biological Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

File Formats ◽

Biological Data Analysis ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Storing and manipulating Next Generation Sequencing (NGS) file formats is an essential but difficult task in biological data analysis. The easyfm ( easy f ile m anipulation) toolkit ( https://github.com/TaekAndBrendan/easyfm ) makes manipulating commonly used NGS files more accessible to biologists. It enables them to perform end-to-end reproducible data analyses using a free standalone desktop application (available on Windows, Mac and Linux). Unlike existing tools (e.g. Galaxy), the Graphical User Interface (GUI)-based easyfm is not dependent on any high-performance computing (HPC) system and can be operated without an internet connection. This specific benefit allow easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences.

Download Full-text

A novel PCR error correction algorithm for cell-free DNA next generation sequencing data using high performance computing

European Journal of Cancer ◽

10.1016/s0959-8049(16)61660-x ◽

2016 ◽

Vol 61 ◽

pp. S186

Author(s):

C.S. Kim ◽

S. Gulati ◽

M. Ayub ◽

D.G. Rothwell ◽

S. Mohan ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Correction ◽

High Performance ◽

Next Generation Sequencing Data ◽

Correction Algorithm ◽

Sequencing Data ◽

Free Dna ◽

Error Correction Algorithm ◽

Performance Computing ◽

Generation Sequencing

Download Full-text

Low-level variant calling for non-matched samples using a position-based and nucleotide-specific approach

BMC Bioinformatics ◽

10.1186/s12859-021-04090-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jeffrey N. Dudley ◽

◽

Celine S. Hong ◽

Marwan A. Hawari ◽

Jasmine Shwetar ◽

...

Keyword(s):

Next Generation Sequencing ◽

Somatic Mosaicism ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Low Level ◽

Pathogenic Variants ◽

Segmental Overgrowth ◽

Generation Sequencing

Abstract Background The widespread use of next-generation sequencing has identified an important role for somatic mosaicism in many diseases. However, detecting low-level mosaic variants from next-generation sequencing data remains challenging. Results Here, we present a method for Position-Based Variant Identification (PBVI) that uses empirically-derived distributions of alternate nucleotides from a control dataset. We modeled this approach on 11 segmental overgrowth genes. We show that this method improves detection of single nucleotide mosaic variants of 0.01–0.05 variant allele fraction compared to other low-level variant callers. At depths of 600 × and 1200 ×, we observed > 85% and > 95% sensitivity, respectively. In a cohort of 26 individuals with somatic overgrowth disorders PBVI showed improved signal to noise, identifying pathogenic variants in 17 individuals. Conclusion PBVI can facilitate identification of low-level mosaic variants thus increasing the utility of next-generation sequencing data for research and diagnostic purposes.

Download Full-text