Tersect: a set theoretical utility for exploring sequence variant data

Bioinformatics ◽

10.1093/bioinformatics/btz634 ◽

2019 ◽

Author(s):

Tomasz J Kurowski ◽

Fady Mohareb

Keyword(s):

High Performance ◽

Supplementary Information ◽

Sequence Variant ◽

Large Set ◽

Bioinformatics Analyses ◽

Genomic Data Visualization ◽

Dataset Size ◽

Genome Browsers ◽

Interactive Data ◽

Core Part

Abstract Summary Comparing genomic features among a large panel of individuals across the same species is considered nowadays a core part of the bioinformatics analyses. This typically involves a series of complex theoretical expressions to compare, intersect, extract symmetric differences between individuals within a large set of genotypes. Several publically available tools are capable of performing such tasks; however, due to the sheer size of variants being queried, such tasks can be computationally expensive with a runtime ranging from few minutes up to several hours depending on the dataset size. This makes existing tools unsuitable for interactive data query or as part of genomic data visualization platforms such as genome browsers. Tersect is a lightweight, high-performance command-line utility which interprets and applies flexible set theoretical expressions to sets of sequence variant data. It can be used both for interactive data exploration and as part of a larger pipeline thanks to its highly optimized storage and indexing algorithms for variant data. Availability and implementation Tersect was implemented in C and released under the MIT license. Tersect is freely available at https://github.com/tomkurowski/tersect. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

aRNApipe: A balanced, efficient and distributed pipeline for processing RNA-seq data in high performance computing environments

10.1101/060277 ◽

2016 ◽

Cited By ~ 2

Author(s):

Arnald Alonso ◽

Brittany N. Lasseigne ◽

Kelly Williams ◽

Josh Nielsen ◽

Ryne C. Ramaker ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Variant Calling ◽

Supplementary Information ◽

Sequence Variant ◽

Rna Seq ◽

Wide Range ◽

Additional Processing ◽

Computational Resources ◽

Performance Computing

AbstractSummaryThe wide range of RNA-seq applications and their high computational needs require the development of pipelines orchestrating the entire workflow and optimizing usage of available computational resources. We present aRNApipe, a project-oriented pipeline for processing of RNA-seq data in high performance cluster environments. aRNApipe is highly modular and can be easily migrated to any high performance computing (HPC) environment. The current applications included in aRNApipe combine the essential RNA-seq primary analyses, including quality control metrics, transcript alignment, count generation, transcript fusion identification, alternative splicing, and sequence variant calling. aRNApipe is project-oriented and dynamic so users can easily update analyses to include or exclude samples or enable additional processing modules. Workflow parameters are easily set using a single configuration file that provides centralized tracking of all analytical processes. Finally, aRNApipe incorporates interactive web reports for sample tracking and a tool for managing the genome assemblies available to perform an analysis.Availability and documentationhttps://github.com/HudsonAlpha/aRNAPipe; DOI:10.5281/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

GWASpro: a high-performance genome-wide association analysis server

Bioinformatics ◽

10.1093/bioinformatics/bty989 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2512-2514 ◽

Cited By ~ 4

Author(s):

Bongsong Kim ◽

Xinbin Dai ◽

Wenchao Zhang ◽

Zhaohong Zhuang ◽

Darlene L Sanchez ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Linear Mixed Model ◽

Association Studies ◽

Learning Curves ◽

Experimental Designs ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Genome Wide

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Beyond spatial scalability limitations with a massively parallel method for linear oscillatory problems

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016687625 ◽

2017 ◽

Vol 32 (6) ◽

pp. 913-933 ◽

Cited By ~ 6

Author(s):

Martin Schreiber ◽

Pedro S Peixoto ◽

Terry Haut ◽

Beth Wingate

Keyword(s):

High Performance ◽

Large Scale ◽

Weather Prediction ◽

Finite Difference Methods ◽

Scaling Limit ◽

Performance Model ◽

Massively Parallel ◽

Large Set ◽

Problem Size ◽

Single Node

This paper presents, discusses and analyses a massively parallel-in-time solver for linear oscillatory partial differential equations, which is a key numerical component for evolving weather, ocean, climate and seismic models. The time parallelization in this solver allows us to significantly exceed the computing resources used by parallelization-in-space methods and results in a correspondingly significantly reduced wall-clock time. One of the major difficulties of achieving Exascale performance for weather prediction is that the strong scaling limit – the parallel performance for a fixed problem size with an increasing number of processors – saturates. A main avenue to circumvent this problem is to introduce new numerical techniques that take advantage of time parallelism. In this paper, we use a time-parallel approximation that retains the frequency information of oscillatory problems. This approximation is based on (a) reformulating the original problem into a large set of independent terms and (b) solving each of these terms independently of each other which can now be accomplished on a large number of high-performance computing resources. Our results are conducted on up to 3586 cores for problem sizes with the parallelization-in-space scalability limited already on a single node. We gain significant reductions in the time-to-solution of 118.3× for spectral methods and 1503.0× for finite-difference methods with the parallelization-in-time approach. A developed and calibrated performance model gives the scalability limitations a priori for this new approach and allows us to extrapolate the performance of the method towards large-scale systems. This work has the potential to contribute as a basic building block of parallelization-in-time approaches, with possible major implications in applied areas modelling oscillatory dominated problems.

Download Full-text

TAIJI: approaching experimental replicates-level accuracy for drug synergy prediction

Bioinformatics ◽

10.1093/bioinformatics/bty955 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2338-2339 ◽

Cited By ~ 5

Author(s):

Hongyang Li ◽

Shuai Hu ◽

Nouri Neamati ◽

Yuanfang Guan

Keyword(s):

Drug Combination ◽

High Performance ◽

Synergistic Effects ◽

Field Performance ◽

Supplementary Information ◽

Drug Synergism ◽

High Throughput Drug Screening ◽

Current State ◽

Candidate Drug ◽

High Prediction

Abstract Motivation Combination therapy is widely used in cancer treatment to overcome drug resistance. High-throughput drug screening is the standard approach to study the drug combination effects, yet it becomes impractical when the number of drugs under consideration is large. Therefore, accurate and fast computational tools for predicting drug synergistic effects are needed to guide experimental design for developing candidate drug pairs. Results Here, we present TAIJI, a high-performance software for fast and accurate prediction of drug synergism. It is based on the winning algorithm in the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge, which is a unique platform to unbiasedly evaluate the performance of current state-of-the-art methods, and includes 160 team-based submission methods. When tested across a broad spectrum of 85 different cancer cell lines and 1089 drug combinations, TAIJI achieved a high prediction correlation (0.53), approaching the accuracy level of experimental replicates (0.56). The runtime is at the scale of minutes to achieve this state-of-the-field performance. Availability and implementation TAIJI is freely available on GitHub (https://github.com/GuanLab/TAIJI). It is functional with built-in Perl and Python. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MITK Diffusion Imaging

Methods of Information in Medicine ◽

10.3414/me11-02-0031 ◽

2012 ◽

Vol 51 (05) ◽

pp. 441-448 ◽

Cited By ~ 52

Author(s):

P. F. Neher ◽

I. Reicht ◽

T. van Bruggen ◽

C. Goch ◽

M. Reisert ◽

...

Keyword(s):

Data Processing ◽

Open Source ◽

High Performance ◽

Diffusion Imaging ◽

Software Tool ◽

Brain Anatomy ◽

Sustainable Evaluation ◽

Imaging Research ◽

Interactive Data

SummaryBackground: Diffusion-MRI provides a unique window on brain anatomy and insights into aspects of tissue structure in living humans that could not be studied previously. There is a major effort in this rapidly evolving field of research to develop the algorithmic tools necessary to cope with the complexity of the datasets.Objectives: This work illustrates our strategy that encompasses the development of a modularized and open software tool for data processing, visualization and interactive exploration in diffusion imaging research and aims at reinforcing sustainable evaluation and progress in the field.Methods: In this paper, the usability and capabilities of a new application and toolkit component of the Medical Imaging and Interaction Toolkit (MITK, www.mitk.org), MITKDI, are demonstrated using in-vivo datasets.Results: MITK-DI provides a comprehensive software framework for high-performance data processing, analysis and interactive data exploration, which is designed in a modular, extensible fashion (using CTK) and in adherence to widely accepted coding standards (e.g. ITK, VTK). MITK-DI is available both as an open source software development toolkit and as a ready-to-use in stallable application.Conclusions: The open source release of the modular MITK-DI tools will increase verifiability and comparability within the research community and will also be an important step towards bringing many of the current techniques towards clinical application.

Download Full-text

Piezoelectric Level Splitting in GaInN/GaN Quantum Wells

MRS Internet Journal of Nitride Semiconductor Research ◽

10.1557/s1092578300002726 ◽

1999 ◽

Vol 4 (S1) ◽

pp. 357-362

Author(s):

C. Wetzel ◽

T. Takeuchi ◽

H. Amano ◽

I. Akasaki

Keyword(s):

Quantum Wells ◽

Electric Fields ◽

High Performance ◽

Stark Effect ◽

Electronic Band Structure ◽

Multiple Quantum Well ◽

Multiple Quantum ◽

Large Set ◽

Electronic Band ◽

Level Splitting

Identification of the electronic band structure in AlInGaN heterostructures is the key issue in high performance light emitter and switching devices. In device-typical GaInN/GaN multiple quantum well samples in a large set of variable composition a clear correspondence of transitions in photo- and electroreflection, as well as photoluminescence is found. The effective band offset across the GaN/GaInN/GaN piezoelectric heterointerface is identified and electric fields from 0.23 - 0.90 MV/cm are directly derived. In the bias voltage dependence a level splitting within the well is observed accompanied by the quantum confined Stark effect. We furthermore find direct correspondence of luminescence bands with reflectance features. This indicates the dominating role of piezoelectric fields in the bandstructure of such typical strained layers.

Download Full-text

COVID-Align: Accurate online alignment of hCoV-19 genomes using a profile HMM

10.1101/2020.05.25.114884 ◽

2020 ◽

Cited By ~ 2

Author(s):

Frédéric Lemoine ◽

Luc Blassel ◽

Jakub Voznica ◽

Olivier Gascuel

Keyword(s):

Daily Basis ◽

Supplementary Information ◽

Summary Statistics ◽

Evolutionary Novelty ◽

Bioinformatics Analyses ◽

Link Type ◽

Sequencing Quality ◽

User Friendly ◽

Profile Hmm ◽

New Mutations

AbstractMotivationThe first cases of the COVID-19 pandemic emerged in December 2019. Until the end of February 2020, the number of available genomes was below 1,000, and their multiple alignment was easily achieved using standard approaches. Subsequently, the availability of genomes has grown dramatically. Moreover, some genomes are of low quality with sequencing/assembly errors, making accurate re-alignment of all genomes nearly impossible on a daily basis. A more efficient, yet accurate approach was clearly required to pursue all subsequent bioinformatics analyses of this crucial data.ResultshCoV-19 genomes are highly conserved, with very few indels and no recombination. This makes the profile HMM approach particularly well suited to align new genomes, add them to an existing alignment and filter problematic ones. Using a core of ∼2,500 high quality genomes, we estimated a profile using HMMER, and implemented this profile in COVID-Align, a user-friendly interface to be used online or as standalone via Docker. The alignment of 1,000 genomes requires less than 20mn on our cluster. Moreover, COVID-Align provides summary statistics, which can be used to determine the sequencing quality and evolutionary novelty of input genomes (e.g. number of new mutations and indels).Availabilityhttps://covalign.pasteur.cloud, hub.docker.com/r/evolbioinfo/[email protected], [email protected] informationSupplementary information is available at Bioinformatics online.

Download Full-text

Predicting runtimes of bioinformatics tools based on historical data: five years of Galaxy usage

Bioinformatics ◽

10.1093/bioinformatics/btz054 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3453-3460 ◽

Cited By ~ 4

Author(s):

Anastasia Tyryshkina ◽

Nate Coraor ◽

Anton Nekrutenko

Keyword(s):

Random Forest ◽

Supplementary Information ◽

Gradient Boosting ◽

Resource Requirement ◽

Bioinformatics Analyses ◽

Runtime Prediction ◽

The Galaxy ◽

Processing Resources ◽

Resource Usage Prediction ◽

The Many

Abstract Motivation One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation. Results Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation. Availability and implementation Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text

Bioinformatics ◽

10.1093/bioinformatics/btaa837 ◽

2020 ◽

Author(s):

Ronghui You ◽

Yuxuan Liu ◽

Hiroshi Mamitsuka ◽

Shanfeng Zhu

Keyword(s):

Full Text ◽

High Performance ◽

Large Scale ◽

Learning Strategy ◽

Learning To Rank ◽

Representation Learning ◽

Supplementary Information ◽

Medical Subject Headings ◽

The Difference ◽

Contextual Representation

Abstract Motivation With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. Results We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

Bioinformatics ◽

10.1093/bioinformatics/btz407 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4907-4911 ◽

Cited By ~ 8

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Supplementary Information ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Scalable Methods

Abstract Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability and implementation An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text