Vargas: heuristic-free alignment for assessing linear and graph read aligners

AbstractRead alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.

Download Full-text

Vargas: heuristic-free alignment for assessing linear and graph read aligners

Bioinformatics ◽

10.1093/bioinformatics/btaa265 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3712-3718

Author(s):

Charlotte A Darby ◽

Ravi Gaddipati ◽

Michael C Schatz ◽

Ben Langmead

Keyword(s):

Alignment Accuracy ◽

Supplementary Information ◽

Local Alignment ◽

Maximum Speed ◽

Command Line ◽

Scoring Functions ◽

Exact Match ◽

Large Numbers ◽

Computationally Intensive ◽

Optimal Alignments

Abstract Motivation Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Results Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these ‘gold standard’ Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly. Availability and implementation Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Alview: Portable Software for Viewing Sequence Reads in BAM Formatted Files

Cancer Informatics ◽

10.4137/cin.s26470 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S26470 ◽

Cited By ~ 2

Author(s):

Richard P. Finney ◽

Qing-Rong Chen ◽

Cu V. Nguyen ◽

Chih Hao Hsu ◽

Chunhua Yan ◽

...

Keyword(s):

Graphical User Interface ◽

Reference Genome ◽

Source Code ◽

Software Tool ◽

Command Line ◽

Sequencing Data ◽

Genome Data ◽

Command Line Tool ◽

Portable Software ◽

Microsoft Windows

The name Alview is a contraction of the term Alignment Viewer. Alview is a compiled to native architecture software tool for visualizing the alignment of sequencing data. Inputs are files of short-read sequences aligned to a reference genome in the SAM/BAM format and files containing reference genome data. Outputs are visualizations of these aligned short reads. Alview is written in portable C with optional graphical user interface (GUI) code written in C, C++, and Objective-C. The application can run in three different ways: as a web server, as a command line tool, or as a native, GUI program. Alview is compatible with Microsoft Windows, Linux, and Apple OS X. It is available as a web demo at https://cgwb.nci.nih.gov/cgi-bin/alview . The source code and Windows/Mac/Linux executables are available via https://github.com/NCIP/alview .

Download Full-text

idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates

10.1101/2020.10.08.330456 ◽

2020 ◽

Author(s):

Xun Zhu ◽

Ti-Cheng Chang ◽

Richard Webby ◽

Gang Wu

Keyword(s):

Personal Computer ◽

Source Code ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Public Dataset ◽

Virus Isolates

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.

Download Full-text

Key Concepts and Definitions of Open Source Communities

Encyclopedia of Networked and Virtual Organizations ◽

10.4018/978-1-59904-885-7.ch099 ◽

2010 ◽

pp. 753-760

Author(s):

Ruben van Wendel de Joode ◽

Sebastian Spaeth

Keyword(s):

Software Development ◽

Open Source ◽

Open Source Software ◽

Online Communities ◽

Source Code ◽

Professional Organizations ◽

Large Numbers ◽

Key Concepts ◽

Open Source Communities ◽

Do So

Most open source software is developed in online communities. These communities are typically referred to as “open source software communities” or “OSS communities.” In OSS communities, the source code, which is the human-readable part of software, is treated as something that is open and that should be downloadable and modifiable to anyone who wishes to do so. The availability of the source code has enabled a practice of decentralized software development in which large numbers of people contribute time and effort. Communities like Linux and Apache, for instance, have been able to connect thousands of individual programmers and professional organizations (although most project communities remain relatively small). These people and organizations are not confined to certain geographical places; on the contrary, they come from literally all continents and they interact and collaborate virtually.

Download Full-text

Scoring functions for drug-effect similarity

Briefings in Bioinformatics ◽

10.1093/bib/bbaa072 ◽

2020 ◽

Cited By ~ 1

Author(s):

Stephan Struckmann ◽

Mathias Ernst ◽

Sarah Fischer ◽

Nancy Mah ◽

Georg Fuellen ◽

...

Keyword(s):

Cell Line ◽

Drug Effect ◽

Pearson Correlation ◽

Source Code ◽

New Drugs ◽

Supplementary Information ◽

Disease Genes ◽

Scoring Functions ◽

Profile Changes

Abstract Motivation The difficulty to find new drugs and bring them to the market has led to an increased interest to find new applications for known compounds. Biological samples from many disease contexts have been extensively profiled by transcriptomics, and, intuitively, this motivates to search for compounds with a reversing effect on the expression of characteristic disease genes. However, disease effects may be cell line-specific and also depend on other factors, such as genetics and environment. Transcription profile changes between healthy and diseased cells relate in complex ways to profile changes gathered from cell lines upon stimulation with a drug. Despite these differences, we expect that there will be some similarity in the gene regulatory networks at play in both situations. The challenge is to match transcriptomes for both diseases and drugs alike, even though the exact molecular pathology/pharmacogenomics may not be known. Results We substitute the challenge to match a drug effect to a disease effect with the challenge to match a drug effect to the effect of the same drug at another concentration or in another cell line. This is welldefined, reproducible in vitro and in silico and extendable with external data. Based on the Connectivity Map (CMap) dataset, we combined 26 different similarity scores with six different heuristics to reduce the number of genes in the model. Such gene filters may also utilize external knowledge e.g. from biological networks. We found that no similarity score always outperforms all others for all drugs, but the Pearson correlation finds the same drug with the highest reliability. Results are improved by filtering for highly expressed genes and to a lesser degree for genes with large fold changes. Also a network-based reduction of contributing transcripts was beneficial, here implemented by the FocusHeuristics. We found no drop in prediction accuracy when reducing the whole transcriptome to the set of 1000 landmark genes of the CMap’s successor project Library of Integrated Network-based Cellular Signatures. All source code to re-analyze and extend the CMap data, the source code of heuristics, filters and their evaluation are available to propel the development of new methods for drug repurposing. Availability https://bitbucket.org/ibima/moldrugeffectsdb Contact [email protected] Supplementary information Supplementary data are available at Briefings in Bioinformatics online.

Download Full-text

aCLImatise: automated generation of tool definitions for bioinformatics workflows

Bioinformatics ◽

10.1093/bioinformatics/btaa1033 ◽

2020 ◽

Author(s):

Michael Milton ◽

Natalie Thorne

Keyword(s):

Source Code ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Automated Generation ◽

Base Camp ◽

Python Package ◽

Bioinformatics Workflow ◽

Bioinformatics Workflows

Abstract Summary aCLImatise is a utility for automatically generating tool definitions compatible with bioinformatics workflow languages, by parsing command-line help output. aCLImatise also has an associated database called the aCLImatise Base Camp, which provides thousands of pre-computed tool definitions. Availability and implementation The latest aCLImatise source code is available within a GitHub organisation, under the GPL-3.0 license: https://github.com/aCLImatise. In particular, documentation for the aCLImatise Python package is available at https://aclimatise.github.io/CliHelpParser/, and the aCLImatise Base Camp is available at https://aclimatise.github.io/BaseCamp/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Asymptomatic giardiasis-more prevalent in refugees than in native inhabitants of the city of Nis, Serbia

Open Medicine ◽

10.2478/s11536-008-0013-2 ◽

2008 ◽

Vol 3 (2) ◽

pp. 203-206

Author(s):

Natasa Miladinovic-Tasic ◽

Suzana Tasic ◽

Ivana Kranjcic-Zec ◽

Gordana Tasic ◽

Aleksandar Tasic ◽

...

Keyword(s):

Microscopic Examination ◽

Gold Standard ◽

Parasitic Infection ◽

Giardia Duodenalis ◽

Population Group ◽

Concentration Technique ◽

Large Numbers ◽

Stool Samples ◽

Statistical Survey ◽

The City

AbstractGiardiasis is a parasitic infection of the digestive tract, most commonly occurring in closed communities such as schools, kindergartens, prisons, and campuses. The civil war in the former Yugoslav republics and in Kosovo caused a large number of refugees to take shelter in the territory of Serbia. Such large numbers of refugees could be accommodated only in the collective centers. Our aim was to examine the differences in the prevalence of asymptomatic giardiasis among 122 refugees from the former Yugoslav republics who lived in the collective centers in Nis, Serbia, and 241 native Nis inhabitants. Conventional microscopic examination (CME) of three stool samples with or without concentration technique and the enzyme immunoassay (EIA) methods were used. The CME method of three stool samples is considered the gold standard in our statistical survey. Asymptomatic giardiasis is found in 7 refugees (5.7%) using the EIA method, while using the CME (3 samples) Giardia duodenalis (G. duodenalis) was detected in 6 persons (4.9%). Using the EIA method and the CME (3 samples) G. duodenalis was detected in only 1 person in the population group of native inhabitants (0.4%). Asymptomatic giardiasis was more prevalent in the population group of refugees accommodated in collective centers than in native inhabitants in the Nis municipality, Serbia.

Download Full-text

The Design of Eman, an Experiment Manager

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2013-0003 ◽

2013 ◽

Vol 99 (1) ◽

pp. 39-56 ◽

Cited By ~ 1

Author(s):

Ondřej Bojar ◽

Aleš Tamchyna

Keyword(s):

Machine Translation ◽

Command Line ◽

Tips And Tricks ◽

Field Of Study ◽

The Core ◽

Large Numbers ◽

Experiment Management ◽

Computational Research ◽

The Many ◽

High Level

Abstract We present eman, a tool for managing large numbers of computational experiments. Over the years of our research in machine translation (MT), we have collected a couple of ideas for efficient experimenting. We believe these ideas are generally applicable in (computational) research of any field. We incorporated them into eman in order to make them available in a command-line Unix environment. The aim of this article is to highlight the core of the many ideas. We hope the text can serve as a collection of experiment management tips and tricks for anyone, regardless their field of study or computer platform they use. The specific examples we provide in eman’s current syntax are less important but they allow us to use concrete terms. The article thus also fills the gap in eman documentation by providing some high-level overview.

Download Full-text

CAFE 5 models variation in evolutionary rates among gene families

Bioinformatics ◽

10.1093/bioinformatics/btaa1022 ◽

2020 ◽

Author(s):

Fábio K Mendes ◽

Dan Vanderpool ◽

Ben Fulton ◽

Matthew W Hahn

Keyword(s):

Software Package ◽

Computational Analysis ◽

Source Code ◽

Gene Families ◽

Gene Family Evolution ◽

Supplementary Information ◽

Rate Variation ◽

Gene Gain ◽

Command Line ◽

Gains And Losses

Abstract Motivation Genome sequencing projects have revealed frequent gains and losses of genes between species. Previous versions of our software, Computational Analysis of gene Family Evolution (CAFE), have allowed researchers to estimate parameters of gene gain and loss across a phylogenetic tree. However, the underlying model assumed that all gene families had the same rate of evolution, despite evidence suggesting a large amount of variation in rates among families. Results Here, we present CAFE 5, a completely re-written software package with numerous performance and user-interface enhancements over previous versions. These include improved support for multithreading, the explicit modeling of rate variation among families using gamma-distributed rate categories, and command-line arguments that preclude the use of accessory scripts. Availability and implementation CAFE 5 source code, documentation, test data and a detailed manual with examples are freely available at https://github.com/hahnlab/CAFE5/releases. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Knot_pull—python package for biopolymer smoothing and knot detection

Bioinformatics ◽

10.1093/bioinformatics/btz644 ◽

2019 ◽

Cited By ~ 1

Author(s):

Aleksandra I Jarmolinska ◽

Anna Gambin ◽

Joanna I Sulkowska

Keyword(s):

Learning Curve ◽

Source Code ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Steep Learning Curve ◽

Independent Source ◽

Python Package

Abstract Summary The biggest hurdle in studying topology in biopolymers is the steep learning curve for actually seeing the knots in structure visualization. Knot_pull is a command line utility designed to simplify this process—it presents the user with a smoothing trajectory for provided structures (any number and length of protein, RNA or chromatin chains in PDB, CIF or XYZ format), and calculates the knot type (including presence of any links, and slipknots when a subchain is specified). Availability and implementation Knot_pull works under Python >=2.7 and is system independent. Source code and documentation are available at http://github.com/dzarmola/knot_pull under GNU GPL license and include also a wrapper script for PyMOL for easier visualization. Examples of smoothing trajectories can be found at: https://www.youtube.com/watch?v=IzSGDfc1vAY. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text