ARBitR: an overlap-aware genome assembly scaffolder for linked reads

Bioinformatics ◽

10.1093/bioinformatics/btaa975 ◽

2020 ◽

Author(s):

Markus Hiltunen ◽

Martin Ryberg ◽

Hanna Johannesson

Keyword(s):

Genome Assembly ◽

General Public ◽

Source Code ◽

Draft Genome ◽

Supplementary Information ◽

Genomic Sequencing ◽

Supplementary Data ◽

Genome Assemblies ◽

General Public License

Abstract Summary Linked genomic sequencing reads contain information that can be used to join sequences together into scaffolds in draft genome assemblies. Existing software for this purpose performs the scaffolding by joining sequences with a gap between them, not considering potential overlaps of contigs. We developed ARBitR to create scaffolds where overlaps are taken into account and show that it can accurately recreate regions where draft assemblies are broken. Availability and implementation ARBitR is written and implemented in Python3 for Unix-based operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License v3. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ARBitR: An overlap-aware genome assembly scaffolder for linked reads

10.1101/2020.04.29.065847 ◽

2020 ◽

Author(s):

Markus Hiltunen ◽

Martin Ryberg ◽

Hanna Johannesson

Keyword(s):

Genome Assembly ◽

General Public ◽

Source Code ◽

Draft Genome ◽

Supplementary Information ◽

Ltr Retrotransposons ◽

Sequencing Data ◽

Long Read ◽

Genome Assemblies ◽

General Public License

Abstract10X Genomics Chromium linked reads contain information that can be used to link sequences together into scaffolds in draft genome assemblies. Existing software for this purpose perform the scaffolding by joining sequences together with a gap between them, not considering potential contig overlaps. Such overlaps can be particularly prominent in genome drafts assembled from long-read sequencing data where an overlap-layout-consensus (OLC) algorithm has been used. Ignoring overlapping contig ends may result in genes and other features being incomplete or fragmented in the resulting scaffolds. We developed the application ARBitR to generate scaffolds from genome drafts using 10X Chromium data, with a focus on minimizing the number of gaps in resulting scaffolds by incorporating an OLC step to resolve junctions between linked contigs. We tested the performance of ARBitR on three published and simulated datasets and compared to the previously published tools ARCS and ARKS. The results revealed that ARBitR performed similarly considering contiguity statistics, and the advantage of the overlapping step was revealed by fewer long and short variants in ARBitR produced scaffolds, in addition to a higher proportion of completely assembled LTR retrotransposons. We expect ARBitR to have broad applicability in genome assembly projects that utilize 10X Chromium linked reads.Availability and implementationARBitR is written and implemented in Python3 for Unix-like operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License [email protected] informationavailable online

Download Full-text

GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database

Bioinformatics ◽

10.1093/bioinformatics/btz848 ◽

2019 ◽

Cited By ~ 161

Author(s):

Pierre-Alain Chaumeil ◽

Aaron J Mussig ◽

Philip Hugenholtz ◽

Donovan H Parks

Keyword(s):

General Public ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Computationally Efficient ◽

Taxonomic Assignments ◽

General Public License

Abstract Summary The GTDB Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB). GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10,156 bacterial and archaeal metagenome-assembled genomes. Availability GTDB-Tk is implemented in Python and licensed under the GNU General Public License v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

10.1101/2020.10.27.356907 ◽

2020 ◽

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Genetic Information ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Diploid Genome ◽

Insertions And Deletions ◽

Structural Variant ◽

Sequencing Technologies ◽

Variant Detection ◽

Genome Assemblies

AbstractMotivationWith the availability of new sequencing technologies, the generation of haplotype-resolved genome assemblies up to chromosome scale has become feasible. These assemblies capture the complete genetic information of both parental haplotypes, increase structural variant (SV) calling sensitivity and enable direct genotyping and phasing of SVs. Yet, existing SV callers are designed for haploid genome assemblies only, do not support genotyping or detect only a limited set of SV classes.ResultsWe introduce our method SVIM-asm for the detection and genotyping of six common classes of SVs from haploid and diploid genome assemblies. Compared against the only other existing SV caller for diploid assemblies, DipCall, SVIM-asm detects more SV classes and reached higher F1 scores for the detection of insertions and deletions on two recently published assemblies of the HG002 individual.Availability and ImplementationSVIM-asm has been implemented in Python and can be easily installed via bioconda. Its source code is available at github.com/eldariont/[email protected] informationSupplementary data are available online.

Download Full-text

A fast and memory-efficient implementation of the transfer bootstrap

Bioinformatics ◽

10.1093/bioinformatics/btz874 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2280-2281 ◽

Cited By ~ 2

Author(s):

Sarah Lutteropp ◽

Alexey M Kozlov ◽

Alexandros Stamatakis

Keyword(s):

General Public ◽

Efficient Implementation ◽

Supplementary Information ◽

Bootstrap Support ◽

Supplementary Data ◽

Original Algorithm ◽

Parallel Version ◽

Branch Support ◽

General Public License ◽

Memory Efficient

Abstract Motivation Recently, Lemoine et al. suggested the transfer bootstrap expectation (TBE) branch support metric as an alternative to classical phylogenetic bootstrap support for taxon-rich datasets. However, the original TBE implementation in the booster tool is compute- and memory-intensive. Results We developed a fast and memory-efficient TBE implementation. We improve upon the original algorithm by Lemoine et al. via several algorithmic and technical optimizations. On empirical as well as on random tree sets with varying taxon counts, our implementation is up to 480 times faster than booster. Furthermore, it only requires memory that is linear in the number of taxa, which leads to 10× to 40× memory savings compared with booster. Availability and implementation Our implementation has been partially integrated into pll-modules and RAxML-NG and is available under the GNU Affero General Public License v3.0 at https://github.com/ddarriba/pll-modules and https://github.com/amkozlov/raxml-ng. The parallel version that also computes additional TBE-related statistics is available at: https://github.com/lutteropp/raxml-ng/tree/tbe. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

polyDFEv2.0: testing for invariance of the distribution of fitness effects within and across species

Bioinformatics ◽

10.1093/bioinformatics/bty1060 ◽

2019 ◽

Vol 35 (16) ◽

pp. 2868-2869 ◽

Cited By ~ 10

Author(s):

Paula Tataru ◽

Thomas Bataillon

Keyword(s):

Source Code ◽

Likelihood Ratio Tests ◽

Supplementary Information ◽

Supplementary Data ◽

Post Processing ◽

Fitness Effects ◽

Site Frequency Spectrum ◽

Genomic Regions ◽

General Public License ◽

R Functions

Abstract Summary Distribution of fitness effects (DFE) of mutations can be inferred from site frequency spectrum (SFS) data. There is mounting interest to determine whether distinct genomic regions and/or species share a common DFE, or whether evidence exists for differences among them. polyDFEv2.0 fits multiple SFS datasets at once and provides likelihood ratio tests for DFE invariance across datasets. Simulations show that testing for DFE invariance across genomic regions within a species requires models accounting for distinct sources of heterogeneity (chance and genuine difference in DFE) underlying differences in SFS data in these regions. Not accounting for this will result in the spurious detection of DFE differences. Availability and Implementation polyDFEv2.0 is implemented in C and is accompanied by a series of R functions that facilitate post-processing of the output. It is available as source code and compiled binaries under a GNU General Public License v3.0 from https://github.com/paula-tataru/polyDFE. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa1034 ◽

2020 ◽

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Genetic Information ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Diploid Genome ◽

Insertions And Deletions ◽

Structural Variant ◽

Sequencing Technologies ◽

Variant Detection ◽

Genome Assemblies

Abstract Motivation With the availability of new sequencing technologies, the generation of haplotype-resolved genome assemblies up to chromosome scale has become feasible. These assemblies capture the complete genetic information of both parental haplotypes, increase structural variant (SV) calling sensitivity and enable direct genotyping and phasing of SVs. Yet, existing SV callers are designed for haploid genome assemblies only, do not support genotyping or detect only a limited set of SV classes. Results We introduce our method SVIM-asm for the detection and genotyping of six common classes of SVs from haploid and diploid genome assemblies. Compared against the only other existing SV caller for diploid assemblies, DipCall, SVIM-asm detects more SV classes and reached higher F1 scores for the detection of insertions and deletions on two recently published assemblies of the HG002 individual. Availability and Implementation SVIM-asm has been implemented in Python and can be easily installed via bioconda. Its source code is available at github.com/eldariont/svim-asm. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

AGOUTI: improving genome assembly and annotation using transcriptome data

10.1101/033019 ◽

2015 ◽

Cited By ~ 1

Author(s):

Simo V. Zhang ◽

Luting Zhuo ◽

Matthew W. Hahn

Keyword(s):

Genome Assembly ◽

Gene Annotation ◽

Supplementary Information ◽

Gene Identification ◽

Supplementary Data ◽

Transcriptome Data ◽

Rna Seq ◽

Separate Gene ◽

Gene Models ◽

Genome Assemblies

AbstractSummaryCurrent genome assemblies consist of thousands of contigs. These incomplete and fragmented assemblies lead to errors in gene identification, such that single genes spread across multiple contigs are annotated as separate gene models. We present AGOUTI (Annotated Genome Optimization Using Transcriptome Information), a tool that uses RNA-seq data to simultaneously combine contigs into scaffolds and fragmented gene models into single models. We show that AGOUTI improves both the contiguity of genome assemblies and the accuracy of gene annotation, providing updated versions of each as output.AvailabilityThe software is implemented in python and is available from github.com/svm-zhang/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

KEC: unique sequence search by K-mer exclusion

Bioinformatics ◽

10.1093/bioinformatics/btab196 ◽

2021 ◽

Author(s):

Pavel Beran ◽

Dagmar Stehlíková ◽

Stephen P Cohen ◽

Vladislav Čurn

Keyword(s):

Amino Acid ◽

Nucleic Acid ◽

Source Code ◽

Unique Sequence ◽

Supplementary Information ◽

Supplementary Data ◽

Laptop Computers ◽

Sequence Search ◽

Target Sequences ◽

Cross Reference

Abstract Summary Searching for amino acid or nucleic acid sequences unique to one organism may be challenging depending on size of the available datasets. K-mer elimination by cross-reference (KEC) allows users to quickly and easily find unique sequences by providing target and non-target sequences. Due to its speed, it can be used for datasets of genomic size and can be run on desktop or laptop computers with modest specifications. Availability and implementation KEC is freely available for non-commercial purposes. Source code and executable binary files compiled for Linux, Mac and Windows can be downloaded from https://github.com/berybox/KEC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BioCommons: a robust java library for RNA structural bioinformatics

Bioinformatics ◽

10.1093/bioinformatics/btab069 ◽

2021 ◽

Author(s):

Tomasz Zok

Keyword(s):

Source Code ◽

Structural Bioinformatics ◽

Supplementary Information ◽

Supplementary Data ◽

Bioinformatic Tools ◽

Data Formats ◽

Central Repository ◽

Diverse Data ◽

2D And 3D ◽

Java Library

Abstract Motivation Biomolecular structures come in multiple representations and diverse data formats. Their incompatibility with the requirements of data analysis programs significantly hinders the analytics and the creation of new structure-oriented bioinformatic tools. Therefore, the need for robust libraries of data processing functions is still growing. Results BioCommons is an open-source, Java library for structural bioinformatics. It contains many functions working with the 2D and 3D structures of biomolecules, with a particular emphasis on RNA. Availability and implementation The library is available in Maven Central Repository and its source code is hosted on GitHub: https://github.com/tzok/BioCommons Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BIOLITMAP: a web-based geolocated, temporal and thematic visualization of the evolution of bioinformatics publications

Bioinformatics ◽

10.1093/bioinformatics/bty967 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2518-2520

Author(s):

Adrián Bazaga ◽

Alfonso Valencia ◽

María- JoséRementeria

Keyword(s):

General Public ◽

Fast Growth ◽

Supplementary Information ◽

Supplementary Data ◽

Web Based ◽

Research Publications

Abstract Motivation The fast growth of bioinformatics adds a significant difficulty to assess the contribution, geographical and thematic distribution of the research publications. Results To help researchers, grant agencies and general public to assess the progress in bioinformatics, we have developed BIOLITMAP, a web-based geolocation system that allows an easy and sensible exploration of the publications by institution, year and topic. Availability and implementation BIOLITMAP is available at http://socialanalytics.bsc.es/biolitmap and the sources have been deposited at https://github.com/inab/BIOLITMAP. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text