Robust high throughput prokaryote de novo assembly and improvement pipeline for Illumina data

Mapping Intimacies ◽

10.1101/052688 ◽

2016 ◽

Cited By ~ 10

Author(s):

Andrew J. Page ◽

Nishadi De Silva ◽

Martin Hunt ◽

Michael A. Quail ◽

Julian Parkhill ◽

...

Keyword(s):

Open Source ◽

High Throughput ◽

De Novo ◽

Bacterial Genome ◽

List Type ◽

Genome Diversity ◽

Link Type ◽

Open Source License ◽

Public Data ◽

Genome Assemblies

ABSTRACTThe rapidly reducing cost of bacterial genome sequencing has lead to its routine use in large scale microbial analysis. Though mapping approaches can be used to find differences relative to the reference, many bacteria are subject to constant evolutionary pressures resulting in events such as the loss and gain of mobile genetic elements, horizontal gene transfer through recombination and genomic rearrangements. De novo assembly is the reconstruction of the underlying genome sequence, an essential step to understanding bacterial genome diversity. Here we present a high throughput bacterial assembly and improvement pipeline that has been used to generate nearly 20,000 draft genome assemblies in public databases. We demonstrate its performance on a public data set of 9,404 genomes. We find all the genes used in MLST schema present in 99.6% of assembled genomes. When tested on low, neutral and high GC organisms, more than 94% of genes were present and completely intact. The pipeline has proven to be scalable and robust with a wide variety of datasets without requiring human intervention. All of the software is available on GitHub under the GNU GPL open source license.DATA SUMMARYThe assembly pipeline software is available from Github under the GNU GPL open source license; (url - https://github.com/sanger-pathogens/vr-codebase)The assembly improvement software is available from Github under the GNU GPL open source license; (url - https://github.com/sanger-pathogens/assembly_improvement)Accession numbers for 9,404 assemblies are provided in the supplementary material.The Bordetella pertussis sample has sample accession ERS1058649, sequencing reads accession ERR1274624 and assembly accessions FJMX01000001-FJMX01000249.The Salmonella enterica subsp. enterica serovar Pullorum sample has sample accession ERS1058652, sequencing reads accession ERR1274625 and assembly accession FJMV01000001-FJMV01000026.The Staphylococcus aureus sample has sample accession ERS1058648, sequencing reads accession ERR1274626 and assembly accessions FJMW01000001-FJMW01000040.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.☑IMPACT STATEMENTThe pipeline described in this paper has been used to assemble and annotate 30% of all bacterial genome assemblies in GenBank (18,080 out of 59,536, accessed 16/2/16). The automated generation of de novo assemblies is a critical step to explore bacterial genome diversity. MLST genes are found in 99.6% of cases, making it at least as good as existing typing methods. In the test genomes we present, more than 94% of genes are correctly assembled into intact reading frames.

RaGOO: fast and accurate reference-guided scaffolding of draft genomes

Genome Biology ◽

10.1186/s13059-019-1829-6 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 56

Author(s):

Michael Alonge ◽

Sebastian Soyk ◽

Srividya Ramakrishnan ◽

Xingang Wang ◽

Sara Goodwin ◽

...

Keyword(s):

Arabidopsis Thaliana ◽

Open Source ◽

Genome Analysis ◽

De Novo ◽

Structural Variants ◽

Tomato Genome ◽

Pan Genome ◽

Link Type ◽

Genome Assemblies

Abstract We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO.

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

10.1101/173146 ◽

2017 ◽

Cited By ~ 5

Author(s):

Mickael Silva ◽

Miguel Machado ◽

Diogo N. Silva ◽

Mirko Rossi ◽

Jacob Moran-Gilad ◽

...

Keyword(s):

Open Source ◽

Core Genome ◽

Bacterial Species ◽

Outbreak Detection ◽

Strain Identification ◽

List Type ◽

Whole Genome ◽

Link Type ◽

The Creation ◽

Allele Calling

ABSTRACTGene-by-gene approaches are becoming increasingly popular in bacterial genomic epidemiology and outbreak detection. However, there is a lack of open-source scalable software for schema definition and allele calling for these methodologies. The chewBBACA suite was designed to assist users in the creation and evaluation of novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. The software can run in a laptop or in high performance clusters making it useful for both small laboratories and large reference centers. ChewBBACA is available athttps://github.com/B-UMMI/chewBBACAor as a docker image athttps://hub.docker.com/r/ummidock/chewbbaca/.DATA SUMMARYAssembled genomes used for the tutorial were downloaded from NCBI in August 2016 by selecting those submitted asStreptococcus agalactiaetaxon or sub-taxa. All the assemblies have been deposited as a zip file in FigShare (https://figshare.com/s/9cbe1d422805db54cd52), where a file with the original ftp link for each NCBI directory is also available.Code for the chewBBACA suite is available athttps://github.com/B-UMMI/chewBBACAwhile the tutorial example is found athttps://github.com/B-UMMI/chewBBACA_tutorial.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTThe chewBBACA software offers a computational solution for the creation, evaluation and use of whole genome (wg) and core genome (cg) multilocus sequence typing (MLST) schemas. It allows researchers to develop wg/cgMLST schemes for any bacterial species from a set of genomes of interest. The alleles identified by chewBBACA correspond to potential coding sequences, possibly offering insights into the correspondence between the genetic variability identified and phenotypic variability. The software performs allele calling in a matter of seconds to minutes per strain in a laptop but is easily scalable for the analysis of large datasets of hundreds of thousands of strains using multiprocessing options. The chewBBACA software thus provides an efficient and freely available open source solution for gene-by-gene methods. Moreover, the ability to perform these tasks locally is desirable when the submission of raw data to a central repository or web services is hindered by data protection policies or ethical or legal concerns.

Balance Trees Reveal Microbial Niche Differentiation

mSystems ◽

10.1128/msystems.00162-16 ◽

2017 ◽

Vol 2 (1) ◽

Cited By ~ 129

Author(s):

James T. Morton ◽

Jon Sanders ◽

Robert A. Quinn ◽

Daniel McDonald ◽

Antonio Gonzalez ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Open Source ◽

Niche Differentiation ◽

Difficult Problem ◽

Individual Species ◽

Rrna Gene ◽

Link Type ◽

Open Source License ◽

Gene Data

ABSTRACT By explicitly accounting for the compositional nature of 16S rRNA gene data through the concept of balances, balance trees yield novel biological insights into niche differentiation. The software to perform this analysis is available under an open-source license and can be obtained at https://github.com/biocore/gneiss . Advances in sequencing technologies have enabled novel insights into microbial niche differentiation, from analyzing environmental samples to understanding human diseases and informing dietary studies. However, identifying the microbial taxa that differentiate these samples can be challenging. These issues stem from the compositional nature of 16S rRNA gene data (or, more generally, taxon or functional gene data); the changes in the relative abundance of one taxon influence the apparent abundances of the others. Here we acknowledge that inferring properties of individual bacteria is a difficult problem and instead introduce the concept of balances to infer meaningful properties of subcommunities, rather than properties of individual species. We show that balances can yield insights about niche differentiation across multiple microbial environments, including soil environments and lung sputum. These techniques have the potential to reshape how we carry out future ecological analyses aimed at revealing differences in relative taxonomic abundances across different samples. IMPORTANCE By explicitly accounting for the compositional nature of 16S rRNA gene data through the concept of balances, balance trees yield novel biological insights into niche differentiation. The software to perform this analysis is available under an open-source license and can be obtained at https://github.com/biocore/gneiss . Author Video: An author video summary of this article is available.

Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies

10.1101/2020.03.15.992941 ◽

2020 ◽

Cited By ~ 15

Author(s):

Arang Rhie ◽

Brian P. Walenz ◽

Sergey Koren ◽

Adam M. Phillippy

Keyword(s):

De Novo ◽

High Accuracy ◽

Link Type ◽

Base Level ◽

Project Home Page ◽

Set Operations ◽

Assembly Evaluation ◽

Long Read ◽

Genome Assemblies ◽

Reference Genomes

AbstractRecent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.Availability of data and materialProject name: MerquryProject home page: https://github.com/marbl/merqury, https://github.com/marbl/merylArchived version: https://github.com/marbl/merqury/releases/tag/v1.0Operating system(s): Platform independentProgramming language: C++, Java, PerlOther requirements: gcc 4.8 or higher, java 1.6 or higherLicense: Public domain (see https://github.com/marbl/merqury/blob/master/README.license) Any restrictions to use by non-academics: No restrictions applied

SpacePHARER: Sensitive identification of phages from CRISPR spacers in prokaryotic hosts

10.1101/2020.05.15.090266 ◽

2020 ◽

Cited By ~ 1

Author(s):

R. Zhang ◽

M. Mirdita ◽

E. Levy Karin ◽

C. Norroy ◽

C. Galiez ◽

...

Keyword(s):

Open Source ◽

Protein Level ◽

De Novo ◽

False Positives ◽

Metagenomic Data ◽

Command Line ◽

Link Type ◽

Host Relationships ◽

User Friendly

SummarySpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein-level, optimizing its scores for matching very short sequences, and combining evidences from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes.Availability and implementationSpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS at spacepharer.soedinglab.org.

Metassembler: Merging and optimizing de novo genome assemblies

10.1101/016352 ◽

2015 ◽

Author(s):

Alejandro Hernandez Wences ◽

Michael Schatz

Keyword(s):

Open Source ◽

Genome Assembly ◽

De Novo ◽

A Genome ◽

Genome Assemblies ◽

Multiple Algorithms

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.

NASQAR: A web-based platform for high-throughput sequencing data analysis and visualization

10.1101/709980 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ayman Yousif ◽

Nizar Drou ◽

Jillian Rowe ◽

Mohammed Khalfan ◽

Kristin C Gunsalus

Keyword(s):

New York ◽

Data Analysis ◽

Open Source ◽

High Throughput ◽

High Throughput Sequencing ◽

Web Applications ◽

Rna Seq ◽

Sequencing Data ◽

Web Based ◽

Link Type

AbstractBackgroundAs high-throughput sequencing applications continue to evolve, the rapid growth in quantity and variety of sequence-based data calls for the development of new software libraries and tools for data analysis and visualization. Often, effective use of these tools requires computational skills beyond those of many researchers. To ease this computational barrier, we have created a dynamic web-based platform, NASQAR (Nucleic Acid SeQuence Analysis Resource).ResultsNASQAR offers a collection of custom and publicly available open-source web applications that make extensive use of a variety of R packages to provide interactive data analysis and visualization. The platform is publicly accessible at http://nasqar.abudhabi.nyu.edu/. Open-source code is on GitHub at https://github.com/nasqar/NASQAR, and the system is also available as a Docker image at https://hub.docker.com/r/aymanm/nasqarall. NASQAR is a collaboration between the core bioinformatics teams of the NYU Abu Dhabi and NYU New York Centers for Genomics and Systems Biology.ConclusionsNASQAR empowers non-programming experts with a versatile and intuitive toolbox to easily and efficiently explore, analyze, and visualize their Transcriptomics data interactively. Popular tools for a variety of applications are currently available, including Transcriptome Data Preprocessing, RNA-seq Analysis (including Single-cell RNA-seq), Metagenomics, and Gene Enrichment.

AGEpy: a Python package for computational biology

10.1101/450890 ◽

2018 ◽

Cited By ~ 1

Author(s):

Franziska Metge ◽

Robert Sehlke ◽

Jorge Boucas

Keyword(s):

Computational Biology ◽

Open Source ◽

High Throughput ◽

Biological Data ◽

Command Line ◽

High Throughput Analysis ◽

Throughput Analysis ◽

Link Type ◽

Biological Meaning ◽

Python Package

AbstractSummary:AGEpy is a Python package focused on the transformation of interpretable data into biological meaning. It is designed to support high-throughput analysis of pre-processed biological data using either local Python based processing or Python based API calls to local or remote servers. In this application note we describe its different Python modules as well as its command line accessible toolsaDiff,abed,blasto,david, andobo2tsv.Availability:The open source AGEpy Python package is freely available at:https://github.com/mpg-age-bioinformatics/AGEpy.Contact:[email protected]

Heat*seq: an interactive web tool for high-throughput sequencing experiment comparison with public data

10.1101/049254 ◽

2016 ◽

Author(s):

Guillaume Devailly ◽

Anna Mantsoki ◽

Anagha Joshi

Keyword(s):

High Throughput ◽

Web Application ◽

High Throughput Sequencing ◽

Single Gene ◽

Public Domain ◽

Web Tool ◽

The Public ◽

Link Type ◽

A Genome ◽

Public Data

SummaryBetter protocols and decreasing costs have made high-throughput sequencing experiments now accessible even to small experimental laboratories. However, comparing one or few experiments generated by an individual lab to the vast amount of relevant data freely available in the public domain might be limited due to lack of bioinformatics expertise. Though several tools, including genome browsers, allow such comparison at a single gene level, they do not provide a genome-wide view. We developed Heat*seq, a web-tool that allows genome scale comparison of high throughput experiments (ChIP-seq, RNA-seq and CAGE) provided by a user, to the data in the public domain. Heat*seq currently contains over 12,000 experiments across diverse tissue and cell types in human, mouse and drosophila. Heat*seq displays interactive correlation heatmaps, with an ability to dynamically subset datasets to contextualise user experiments. High quality figures and tables are produced and can be downloaded in multiple formats.AvailabilityWeb application:www.heatstarseq.roslin.ed.ac.uk/. Source code:https://github.com/[email protected];[email protected]

Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline

10.1101/657890 ◽

2019 ◽

Cited By ~ 7

Author(s):

Shujun Ou ◽

Weija Su ◽

Yi Liao ◽

Kapeel Chougule ◽

Doreen Ware ◽

...

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Open Source ◽

Performance Metrics ◽

De Novo ◽

Relative Performance ◽

Sequencing Technology ◽

High Quality ◽

Link Type ◽

Assembly Algorithms

AbstractSequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and allow for annotation of TEs. There are numerous methods for each class of elements with unknown relative performance metrics. We benchmarked existing programs based on a curated library of rice TEs. Using the most robust programs, we created a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a condensed TE library for annotations of structurally intact and fragmented elements. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.