KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies

ABSTRACTMotivationDe novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data beneﬁts from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilised by assemblers, provides useful insights that can inform the assembly process and result in better assemblies.ResultsWe present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT’s ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies.AvailabilityKAT is available under the GPLv3 license at: https://github.com/TGAC/[email protected] InformationSupplementary Information (SI) is available at Bioinformatics online. In addition, the software documentation is available online at: http://kat.readthedocs.io/en/latest/.

Download Full-text

multiPhATE: bioinformatics pipeline for functional annotation of phage isolates

10.1101/551010 ◽

2019 ◽

Author(s):

Carol L. Ecale Zhou ◽

Stephanie Malfatti ◽

Jeffrey Kimbrel ◽

Casandra Philipson ◽

Katelyn McNair ◽

...

Keyword(s):

De Novo ◽

Third Party ◽

Supplementary Information ◽

Modular Construction ◽

Bioinformatics Pipeline ◽

Annotation Pipeline ◽

Phage Gene ◽

Software Documentation ◽

Link Type ◽

Multiple Processors

ABSTRACTSummaryTo address the need for improved phage annotation tools that scale, we created an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene-calling algorithm and assigns putative functions to gene calls using protein-, virus-, and phage-centric databases. multiPhATE’s modular construction allows the user to implement all or any portion of the analyses by acquiring local instances of the desired databases and specifying the desired analyses in a configuration file. We demonstrate multiPhATE by annotating two newly sequenced Yersinia pestis phage genomes. Within multiPhATE, the PhATE processing pipeline can be readily implemented across multiple processors, making it adaptable for throughput sequencing projects. Software documentation assists the user in configuring the system.Availability and implementationmultiPhATE was implemented in Python 3.7, and runs as a command-line code under Linux or Unix. multiPhATE is freely available under an open-source BSD3 license from https://github.com/carolzhou/multiPhATE. Instructions for acquiring the databases and third-party codes used by multiPhATE are included in the distribution README file. Users may report bugs by submitting to the github issues page associated with the multiPhATE [email protected] or [email protected] informationData generated during the current study are included as supplementary files available for download at https://github.com/carolzhou/PhATE_docs.

Download Full-text

ntEdit: scalable genome sequence polishing

Bioinformatics ◽

10.1093/bioinformatics/btz400 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4430-4432 ◽

Cited By ~ 8

Author(s):

René L Warren ◽

Lauren Coombe ◽

Hamid Mohamadi ◽

Jessica Zhang ◽

Barry Jaquish ◽

...

Keyword(s):

Human Genome ◽

Genome Sequence ◽

Sequence Data ◽

Bloom Filter ◽

Supplementary Information ◽

Routine Practice ◽

High Coverage ◽

Illumina Sequence ◽

Genome Assemblies ◽

Indel Rate

Abstract Motivation In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. Results We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability and implementation https://github.com/bcgsc/ntedit Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

multiPhATE: bioinformatics pipeline for functional annotation of phage isolates

Bioinformatics ◽

10.1093/bioinformatics/btz258 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4402-4404 ◽

Cited By ~ 10

Author(s):

Carol L Ecale Zhou ◽

Stephanie Malfatti ◽

Jeffrey Kimbrel ◽

Casandra Philipson ◽

Katelyn McNair ◽

...

Keyword(s):

De Novo ◽

Third Party ◽

Supplementary Information ◽

Modular Construction ◽

Bioinformatics Pipeline ◽

Annotation Pipeline ◽

Phage Gene ◽

Software Documentation ◽

Multiple Processors ◽

Calling Algorithm

Abstract Summary To address the need for improved phage annotation tools that scale, we created an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene calling algorithm and assigns putative functions to gene calls using protein-, virus- and phage-centric databases. multiPhATE’s modular construction allows the user to implement all or any portion of the analyses by acquiring local instances of the desired databases and specifying the desired analyses in a configuration file. We demonstrate multiPhATE by annotating two newly sequenced Yersinia pestis phage genomes. Within multiPhATE, the PhATE processing pipeline can be readily implemented across multiple processors, making it adaptable for throughput sequencing projects. Software documentation assists the user in configuring the system. Availability and implementation multiPhATE was implemented in Python 3.7, and runs as a command-line code under Linux or Unix. multiPhATE is freely available under an open-source BSD3 license from https://github.com/carolzhou/multiPhATE. Instructions for acquiring the databases and third-party codes used by multiPhATE are included in the distribution README file. Users may report bugs by submitting to the github issues page associated with the multiPhATE distribution. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies

10.1101/2020.03.15.992941 ◽

2020 ◽

Cited By ~ 15

Author(s):

Arang Rhie ◽

Brian P. Walenz ◽

Sergey Koren ◽

Adam M. Phillippy

Keyword(s):

De Novo ◽

High Accuracy ◽

Link Type ◽

Base Level ◽

Project Home Page ◽

Set Operations ◽

Assembly Evaluation ◽

Long Read ◽

Genome Assemblies ◽

Reference Genomes

AbstractRecent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.Availability of data and materialProject name: MerquryProject home page: https://github.com/marbl/merqury, https://github.com/marbl/merylArchived version: https://github.com/marbl/merqury/releases/tag/v1.0Operating system(s): Platform independentProgramming language: C++, Java, PerlOther requirements: gcc 4.8 or higher, java 1.6 or higherLicense: Public domain (see https://github.com/marbl/merqury/blob/master/README.license) Any restrictions to use by non-academics: No restrictions applied

Download Full-text

RaGOO: fast and accurate reference-guided scaffolding of draft genomes

Genome Biology ◽

10.1186/s13059-019-1829-6 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 56

Author(s):

Michael Alonge ◽

Sebastian Soyk ◽

Srividya Ramakrishnan ◽

Xingang Wang ◽

Sara Goodwin ◽

...

Keyword(s):

Arabidopsis Thaliana ◽

Open Source ◽

Genome Analysis ◽

De Novo ◽

Structural Variants ◽

Tomato Genome ◽

Pan Genome ◽

Link Type ◽

Genome Assemblies

Abstract We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO.

Download Full-text

Varstation: a complete and efficient tool to support NGS data analysis

10.1101/833582 ◽

2019 ◽

Author(s):

ACO Faria ◽

MP Caraciolo ◽

RM Minillo ◽

TF Almeida ◽

SM Pereira ◽

...

Keyword(s):

Genetic Variation ◽

Data Analysis ◽

Supplementary Information ◽

Human Genetic Variation ◽

Supplementary Data ◽

Efficient Tool ◽

Link Type ◽

Data Processor ◽

Ngs Data Analysis ◽

Ngs Data

AbstractSummaryVarstation is a cloud-based NGS data processor and analyzer for human genetic variation. This resource provides a customizable, centralized, safe and clinically validated environment aiming to improve and optimize the flow of NGS analyses and reports related with clinical and research genetics.Availability and implementationVarstation is freely available at http://varstation.com, for academic [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Easily phylotyping E. coli via the EzClermont web app and command-line tool

10.1101/317610 ◽

2018 ◽

Cited By ~ 3

Author(s):

Nicholas R. Waters ◽

Florence Abram ◽

Fiona Brennan ◽

Ashleigh Holmes ◽

Leighton Pritchard

Keyword(s):

Supplementary Information ◽

Validation Dataset ◽

Command Line ◽

E Coli ◽

Link Type ◽

Command Line Tool ◽

Pcr Method ◽

Web App ◽

Local Use ◽

Genome Assemblies

SummaryThe Clermont PCR method of phylotyping Escherichia coli has remained a useful classification scheme despite the proliferation of higher-resolution sequence typing schemes. We have implemented an in silico Clermont PCR method as both a web app and as a command-line tool to allow researchers to easily apply this phylotyping scheme to genome assemblies easily.Availability and ImplementationEzClermont is available as a web app at http://www.ezclermont.org. For local use, EzClermont can be installed with pip or installed from the source code at https://github.com/nickp60/ezclermont. All analysis was done with version [email protected], [email protected] informationTable S1: test dataset; S2: validation dataset; S3: results.

Download Full-text

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

Download Full-text

A high-quality de novo genome assembly based on nanopore sequencing of a wild-caught coconut rhinoceros beetle (Oryctes rhinoceros)

10.1101/2021.09.12.459717 ◽

2021 ◽

Author(s):

Igor Filipović ◽

Gordana Rašić ◽

James Hereward ◽

Maria Gharuka ◽

Gregor J Devine ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Nuclear Genome ◽

Assembly Process ◽

Structural Annotation ◽

High Quality ◽

Oryctes Rhinoceros ◽

Rhinoceros Beetle ◽

Long Read ◽

Genome Assemblies

Background: An optimal starting point for relating genome function to organismal biology is a high-quality nuclear genome assembly, and long-read sequencing is revolutionizing the production of this genomic resource in insects. Despite this, nuclear genome assemblies have been under-represented for agricultural insect pests, particularly from the order Coleoptera. Here we present a de novo genome assembly and structural annotation for the coconut rhinoceros beetle, Oryctes rhinoceros (Coleoptera: Scarabaeidae), based on Oxford Nanopore Technologies (ONT) long-read data generated from a wild-caught female, as well as the assembly process that also led to the recovery of the complete circular genome assemblies of the beetle's mitochondrial genome and that of the biocontrol agent, Oryctes rhinoceros nudivirus (OrNV). As an invasive pest of palm trees, O. rhinoceros is undergoing an expansion in its range across the Pacific Islands, requiring new approaches to management that may include strategies facilitated by genome assembly and annotation. Results: High-quality DNA isolated from an adult female was used to create four ONT libraries that were sequenced using four MinION flow cells, producing a total of 27.2 Gb of high-quality long-read sequences. We employed an iterative assembly process and polishing with one lane of high-accuracy Illumina reads, obtaining a final size of the assembly of 377.36 Mb that had high contiguity (fragment N50 length = 12 Mb) and accuracy, as evidenced by the exceptionally high completeness of the benchmarked set of conserved single-copy orthologous genes (BUSCO completeness = 99.11%). These quality metrics place our assembly as the most complete of the published Coleopteran genomes. The structural annotation of the nuclear genome assembly contained a highly-accurate set of 16,371 protein-coding genes showing BUSCO completeness of 92.09%, as well as the expected number of non-coding RNAs and the number and structure of paralogous genes in a gene family like Sigma GST. Conclusions: The genomic resources produced in this study form a foundation for further functional genetic research and management programs that may inform the control and surveillance of O. rhinoceros populations, and we demonstrate the efficacy of de novo genome assembly using long-read ONT data from a single field-caught insect.

Download Full-text

Plasmid Profiler: Comparative Analysis of Plasmid Content in WGS Data

10.1101/121350 ◽

2017 ◽

Cited By ~ 2

Author(s):

Adrian Zetner ◽

Jennifer Cabral ◽

Laura Mataseje ◽

Natalie C Knox ◽

Philip Mabon ◽

...

Keyword(s):

Comparative Analysis ◽

De Novo ◽

Sequence Data ◽

Health Agency ◽

R Package ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Supplementary Information ◽

Plasmid Content ◽

Link Type

AbstractSummaryComparative analysis of bacterial plasmids from whole genome sequence (WGS) data generated from short read sequencing is challenging. This is due to the difficulty in identifying contigs harbouring plasmid sequence data, and further difficulty in assembling such contigs into a full plasmid. As such, few software programs and bioinformatics pipelines exist to perform comprehensive comparative analyses of plasmids within and amongst sequenced isolates. To address this gap, we have developed Plasmid Profiler, a pipeline to perform comparative plasmid content analysis without the need forde novoassembly. The pipeline is designed to rapidly identify plasmid sequences by mapping reads to a plasmid reference sequence database. Predicted plasmid sequences are then annotated with their incompatibility group, if known. The pipeline allows users to query plasmids for genes or regions of interest and visualize results as an interactive heat map.Availability and ImplementationPlasmid Profiler is freely available software released under the Apache 2.0 open source software license. A stand-alone version of the entire Plasmid Profiler pipeline is available as a Docker container athttps://hub.docker.com/r/phacnml/plasmidprofiler_0_1_6/.The conda recipe for the Plasmid R package is available at:https://anaconda.org/bioconda/r-plasmidprofilerThe custom Plasmid Profiler R package is also available as a CRAN package athttps://cran.r-project.org/web/packages/Plasmidprofiler/index.htmlGalaxy tools associated with the pipeline are available as a Galaxy tool suite athttps://toolshed.g2.bx.psu.edu/repository?repository_id=55e082200d16a504The source code is available at:https://github.com/phac-nml/plasmidprofilerThe Galaxy implementation is available at:https://github.com/phac-nml/plasmidprofiler-galaxyContactEmail:[email protected]: National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, Manitoba, CanadaSupplementary informationDocumentation:http://plasmid-profiler.readthedocs.io/en/latest/

Download Full-text