Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

Oliver Schwengers; Lukas Jelonek; Marius Alfred Dieckmann; Sebastian Beyvers; Jochen Blom; Alexander Goesmann

doi:10.1099/mgen.0.000685

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

Microbial Genomics ◽

10.1099/mgen.0.000685 ◽

2021 ◽

Vol 7 (11) ◽

Author(s):

Oliver Schwengers ◽

Lukas Jelonek ◽

Marius Alfred Dieckmann ◽

Sebastian Beyvers ◽

Jochen Blom ◽

...

Keyword(s):

Software Tool ◽

Software Tools ◽

Command Line ◽

Bacterial Genomes ◽

Functional Annotations ◽

Link Type ◽

Small Proteins ◽

Alignment Free ◽

Sequence Identification ◽

Downstream Analysis

Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.

Download Full-text

Bakta: Rapid & standardized annotation of bacterial genomes via alignment-free sequence identification

10.1101/2021.09.02.458689 ◽

2021 ◽

Author(s):

Oliver Schwengers ◽

Lukas Jelonek ◽

Marius Dieckmann ◽

Sebastian Beyvers ◽

Jochen Blom ◽

...

Keyword(s):

Software Tool ◽

Software Tools ◽

Command Line ◽

Bacterial Genomes ◽

Functional Annotations ◽

Small Proteins ◽

Alignment Free ◽

Sequence Identification ◽

Identification Approach ◽

Downstream Analysis

AbstractCommand line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command line software pipelines heavily depend on taxon specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command line software tool for the robust, taxon-independent, thorough and nonetheless fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross references. Annotation results are exported in GFF3 and INSDC-compliant flat files as well as comprehensive JSON files facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references whilst providing comparable wall clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.

Download Full-text

UPS-indel: a Universal Positioning System for Indels

10.1101/133553 ◽

2017 ◽

Cited By ~ 3

Author(s):

Mohammad Shabbir Hasan ◽

Xiaowei Wu ◽

Layne T. Watson ◽

Zhiyi Li ◽

Liqing Zhang

Keyword(s):

State Of The Art ◽

Online Version ◽

Positioning System ◽

Command Line ◽

Human Chromosomes ◽

Link Type ◽

Indel Calling ◽

Downstream Analysis ◽

Command Line Version ◽

New System

AbstractBackgroundIndels, though differing in allele sequence and position, are biologically equivalent when they lead to the same altered sequences. Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and may mislead downstream analysis and interpretations. About 10% of the human indels stored in dbSNP are redundant. It is thus desirable to have a unified system for identifying and representing equivalent indels in publically available databases. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare indel calling results produced by different tools.ResultsUPS-indel identifies nearly 15% indels in dbSNP (version 142) as redundant across all human chromosomes, higher than previously reported. When applied to COSMIC coding and noncoding indel datasets, UPS-indel identifies nearly 29% and 13% indels as redundant, respectively. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to other state-of-the-art approaches for indel call set comparison demonstrates that UPS-indel is clearly superior to other approaches in finding indels in common among call sets.ConclusionsUPS-indel is theoretically proven to find all equivalent indels, and is thus exhaustive. UPS-indel is written in C++ and the command line version is freely available to download at http://ups-indel.sourceforge.net. The online version of UPS-indel is available at http://bench.cs.vt.edu/ups-indel/.

Download Full-text

TRTools: a toolkit for genome-wide analysis of tandem repeats

10.1101/2020.03.17.996033 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nima Mousavi ◽

Jonathan Margoliash ◽

Neha Pusarla ◽

Shubham Saini ◽

Richard Yanicky ◽

...

Keyword(s):

Quality Control ◽

Tandem Repeats ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Genome Wide Analysis ◽

Link Type ◽

Genome Wide ◽

Wide Range ◽

Downstream Analysis

AbstractSummaryA rich set of tools have recently been developed for performing genome-wide genotyping of tandem repeats (TRs). However, standardized tools for downstream analysis of these results are lacking. To facilitate TR analysis applications, we present TRTools, a Python library and a suite of command-line tools for filtering, merging, and quality control of TR genotype files. TRTools utilizes an internal harmonization module making it compatible with outputs from a wide range of TR genotypers.AvailabilityTRTools is freely available at https://github.com/gymreklab/[email protected] informationSupplementary data are available at bioRxiv.

Download Full-text

BuddySuite: Command-line toolkits for manipulating sequences, alignments, and phylogenetic trees

10.1101/040675 ◽

2016 ◽

Author(s):

Stephen R. Bond ◽

Karl E. Keat ◽

Sofia N. Barreira ◽

Andreas D. Baxevanis

Keyword(s):

Sequence Alignment ◽

Phylogenetic Trees ◽

Phylogenetic Reconstruction ◽

General Purpose ◽

Command Line ◽

Link Type ◽

File Formats ◽

Downstream Analysis ◽

Python Package ◽

Common Sequence

AbstractThe ability to manipulate sequence, alignment, and phylogenetic tree files has become an increasingly important skill in the life sciences, whether to generate summary information or to prepare data for further downstream analysis. The command line can be an extremely powerful environment for interacting with these resources, but only if the user has the appropriate general-purpose tools on hand. BuddySuite is a collection of four independent yet interrelated command-line toolkits that facilitate each step in the workflow of sequence discovery, curation, alignment, and phylogenetic reconstruction. Most common sequence, alignment, and tree file formats are automatically detected and parsed, and over 100 tools have been implemented for manipulating these data. The project has been engineered to easily accommodate the addition of new tools, it is written in the popular programming language Python, and is hosted on the Python Package Index and GitHub to maximize accessibility. Documentation for each BuddySuite tool, including usage examples, is available at http://tiny.cc/buddysuite_wiki. All software is open source and freely available through http://research.nhgri.nih.gov/software/BuddySuite.

Download Full-text

plasmidSPAdes: Assembling Plasmids from Whole Genome Sequencing Data

10.1101/048942 ◽

2016 ◽

Cited By ~ 15

Author(s):

Dmitry Antipov ◽

Nolan Hartwick ◽

Max Shen ◽

Mikhail Raiko ◽

Alla Lapidus ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Software Tool ◽

Software Tools ◽

Whole Genome Sequencing Data ◽

Antibiotics Resistance ◽

Whole Genome ◽

Sequencing Data ◽

Bacterial Genomes ◽

Specialized Software

ABSTRACTMotivationPlasmids are stably maintained extra-chromosomal genetic elements that replicate independently from the host cell’s chromosomes. Although plasmids harbor biomedically important genes, (such as genes involved in virulence and antibiotics resistance), there is a shortage of specialized software tools for extracting and assembling plasmid data from whole genome sequencing projects.ResultsWe present the plasmidSPAdes algorithm and software tool for assembling plasmids from whole genome sequencing data and benchmark its performance on a diverse set of bacterial genomes.Availability and implementationPLASMIDSPADESis publicly available athttp://spades.bioinf.spbau.ru/plasmidSPAdes/[email protected]

Download Full-text

Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics

BMC Biology ◽

10.1186/s12915-020-00938-6 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Congyu Lu ◽

Zheng Zhang ◽

Zena Cai ◽

Zhaozhong Zhu ◽

Ye Qiu ◽

...

Keyword(s):

Prediction Accuracy ◽

Functional Characterization ◽

Gaussian Model ◽

Software Tool ◽

Biological Properties ◽

Rapid Identification ◽

Taxonomic Assignment ◽

Genus Level ◽

Alignment Free ◽

Archaeal Viruses

Abstract Background Viruses are ubiquitous biological entities, estimated to be the largest reservoirs of unexplored genetic diversity on Earth. Full functional characterization and annotation of newly discovered viruses requires tools to enable taxonomic assignment, the range of hosts, and biological properties of the virus. Here we focus on prokaryotic viruses, which include phages and archaeal viruses, and for which identifying the viral host is an essential step in characterizing the virus, as the virus relies on the host for survival. Currently, the method for determining the viral host is either to culture the virus, which is low-throughput, time-consuming, and expensive, or to computationally predict the viral hosts, which needs improvements at both accuracy and usability. Here we develop a Gaussian model to predict hosts for prokaryotic viruses with better performances than previous computational methods. Results We present here Prokaryotic virus Host Predictor (PHP), a software tool using a Gaussian model, to predict hosts for prokaryotic viruses using the differences of k-mer frequencies between viral and host genomic sequences as features. PHP gave a host prediction accuracy of 34% (genus level) on the VirHostMatcher benchmark dataset and a host prediction accuracy of 35% (genus level) on a new dataset containing 671 viruses and 60,105 prokaryotic genomes. The prediction accuracy exceeded that of two alignment-free methods (VirHostMatcher and WIsH, 28–34%, genus level). PHP also outperformed these two alignment-free methods much (24–38% vs 18–20%, genus level) when predicting hosts for prokaryotic viruses which cannot be predicted by the BLAST-based or the CRISPR-spacer-based methods alone. Requiring a minimal score for making predictions (thresholding) and taking the consensus of the top 30 predictions further improved the host prediction accuracy of PHP. Conclusions The Prokaryotic virus Host Predictor software tool provides an intuitive and user-friendly API for the Gaussian model described herein. This work will facilitate the rapid identification of hosts for newly identified prokaryotic viruses in metagenomic studies.

Download Full-text

Effiziente Produktionsgestaltung*/Efficient production design - Development of a software tool for a process- and competence-oriented decision support

wt Werkstattstechnik online ◽

10.37544/1436-4980-2016-07-08-78 ◽

2016 ◽

Vol 106 (07-08) ◽

pp. 544-549

Author(s):

V. K. Bellmann ◽

P. Prof. Nyhuis

Keyword(s):

Decision Support ◽

Software Tool ◽

Software Tools ◽

Efficient Production ◽

Design Development ◽

Huge Amount

Zur Erhaltung ihrer Wettbewerbsfähigkeit setzen Unternehmen sowohl prozessverbessernde als auch kompetenzsteigernde Methoden ein. Jedoch erschwert die Vielzahl an Methoden eine anwendungsspezifische Auswahl. Somit wird ein Software-Tool benötigt, das neben den individuellen Zielstellungen auch die Voraussetzungen für eine erfolgreiche Umsetzung der Methoden berücksichtigt. Dieser Fachbeitrag beschreibt die Entwicklung eines Software-Tools zur zielgerichteten Entscheidungsunterstützung.   Companies apply process-improving and competence-increasing methods to maintain their competitiveness. However the huge amount of existing methods impedes an application-oriented selection. Thus a software tool is needed which considers individual objectives as well as requirements for a successful application of the methods. This paper describes the development of a software tool for a target-oriented decision support.

Download Full-text

Alview: Portable Software for Viewing Sequence Reads in BAM Formatted Files

Cancer Informatics ◽

10.4137/cin.s26470 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S26470 ◽

Cited By ~ 2

Author(s):

Richard P. Finney ◽

Qing-Rong Chen ◽

Cu V. Nguyen ◽

Chih Hao Hsu ◽

Chunhua Yan ◽

...

Keyword(s):

Graphical User Interface ◽

Reference Genome ◽

Source Code ◽

Software Tool ◽

Command Line ◽

Sequencing Data ◽

Genome Data ◽

Command Line Tool ◽

Portable Software ◽

Microsoft Windows

The name Alview is a contraction of the term Alignment Viewer. Alview is a compiled to native architecture software tool for visualizing the alignment of sequencing data. Inputs are files of short-read sequences aligned to a reference genome in the SAM/BAM format and files containing reference genome data. Outputs are visualizations of these aligned short reads. Alview is written in portable C with optional graphical user interface (GUI) code written in C, C++, and Objective-C. The application can run in three different ways: as a web server, as a command line tool, or as a native, GUI program. Alview is compatible with Microsoft Windows, Linux, and Apple OS X. It is available as a web demo at https://cgwb.nci.nih.gov/cgi-bin/alview . The source code and Windows/Mac/Linux executables are available via https://github.com/NCIP/alview .

Download Full-text

SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008439 ◽

2020 ◽

Vol 16 (12) ◽

pp. e1008439

Author(s):

Jennifer Lu ◽

Steven L. Salzberg

Keyword(s):

Large Scale ◽

Analysis Tool ◽

Index Test ◽

Bacterial Genomes ◽

Phylogenetic Groups ◽

Bacterial Phyla ◽

Link Type ◽

Gc Skew ◽

A Genome ◽

Web App

GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI’s Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT.

Download Full-text

idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates

10.1101/2020.10.08.330456 ◽

2020 ◽

Author(s):

Xun Zhu ◽

Ti-Cheng Chang ◽

Richard Webby ◽

Gang Wu

Keyword(s):

Personal Computer ◽

Source Code ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Public Dataset ◽

Virus Isolates

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.

Download Full-text