multiPhATE: bioinformatics pipeline for functional annotation of phage isolates

Mapping Intimacies ◽

10.1101/551010 ◽

2019 ◽

Author(s):

Carol L. Ecale Zhou ◽

Stephanie Malfatti ◽

Jeffrey Kimbrel ◽

Casandra Philipson ◽

Katelyn McNair ◽

...

Keyword(s):

De Novo ◽

Third Party ◽

Supplementary Information ◽

Modular Construction ◽

Bioinformatics Pipeline ◽

Annotation Pipeline ◽

Phage Gene ◽

Software Documentation ◽

Link Type ◽

Multiple Processors

ABSTRACTSummaryTo address the need for improved phage annotation tools that scale, we created an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene-calling algorithm and assigns putative functions to gene calls using protein-, virus-, and phage-centric databases. multiPhATE’s modular construction allows the user to implement all or any portion of the analyses by acquiring local instances of the desired databases and specifying the desired analyses in a configuration file. We demonstrate multiPhATE by annotating two newly sequenced Yersinia pestis phage genomes. Within multiPhATE, the PhATE processing pipeline can be readily implemented across multiple processors, making it adaptable for throughput sequencing projects. Software documentation assists the user in configuring the system.Availability and implementationmultiPhATE was implemented in Python 3.7, and runs as a command-line code under Linux or Unix. multiPhATE is freely available under an open-source BSD3 license from https://github.com/carolzhou/multiPhATE. Instructions for acquiring the databases and third-party codes used by multiPhATE are included in the distribution README file. Users may report bugs by submitting to the github issues page associated with the multiPhATE [email protected] or [email protected] informationData generated during the current study are included as supplementary files available for download at https://github.com/carolzhou/PhATE_docs.

multiPhATE: bioinformatics pipeline for functional annotation of phage isolates

Bioinformatics ◽

10.1093/bioinformatics/btz258 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4402-4404 ◽

Cited By ~ 10

Author(s):

Carol L Ecale Zhou ◽

Stephanie Malfatti ◽

Jeffrey Kimbrel ◽

Casandra Philipson ◽

Katelyn McNair ◽

...

Keyword(s):

De Novo ◽

Third Party ◽

Supplementary Information ◽

Modular Construction ◽

Bioinformatics Pipeline ◽

Annotation Pipeline ◽

Phage Gene ◽

Software Documentation ◽

Multiple Processors ◽

Calling Algorithm

Abstract Summary To address the need for improved phage annotation tools that scale, we created an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene calling algorithm and assigns putative functions to gene calls using protein-, virus- and phage-centric databases. multiPhATE’s modular construction allows the user to implement all or any portion of the analyses by acquiring local instances of the desired databases and specifying the desired analyses in a configuration file. We demonstrate multiPhATE by annotating two newly sequenced Yersinia pestis phage genomes. Within multiPhATE, the PhATE processing pipeline can be readily implemented across multiple processors, making it adaptable for throughput sequencing projects. Software documentation assists the user in configuring the system. Availability and implementation multiPhATE was implemented in Python 3.7, and runs as a command-line code under Linux or Unix. multiPhATE is freely available under an open-source BSD3 license from https://github.com/carolzhou/multiPhATE. Instructions for acquiring the databases and third-party codes used by multiPhATE are included in the distribution README file. Users may report bugs by submitting to the github issues page associated with the multiPhATE distribution. Supplementary information Supplementary data are available at Bioinformatics online.

KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies

10.1101/064733 ◽

2016 ◽

Cited By ~ 6

Author(s):

Daniel Mapleson ◽

Gonzalo Garcia Accinelli ◽

George Kettleborough ◽

Jonathan Wright ◽

Bernardo J. Clavijo

Keyword(s):

Quality Control ◽

De Novo ◽

Pairwise Comparison ◽

Supplementary Information ◽

Assembly Process ◽

High Coverage ◽

Software Documentation ◽

Link Type ◽

Genome Assemblies ◽

Ngs Data

ABSTRACTMotivationDe novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data beneﬁts from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilised by assemblers, provides useful insights that can inform the assembly process and result in better assemblies.ResultsWe present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT’s ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies.AvailabilityKAT is available under the GPLv3 license at: https://github.com/TGAC/[email protected] InformationSupplementary Information (SI) is available at Bioinformatics online. In addition, the software documentation is available online at: http://kat.readthedocs.io/en/latest/.

epiGBS2: an improved protocol and automated snakemake workflow for highly multiplexed reduced representation bisulfite sequencing

10.1101/2020.06.23.137091 ◽

2020 ◽

Author(s):

Fleur Gawehns ◽

Maarten Postuma ◽

Thomas P. van Gurp ◽

Niels C. A. M. Wagemaker ◽

Samar Fatma ◽

...

Keyword(s):

Cytosine Methylation ◽

Bisulfite Sequencing ◽

De Novo ◽

Bioinformatics Pipeline ◽

Reduced Representation ◽

Link Type ◽

Time Investment ◽

Reduced Representation Bisulfite Sequencing ◽

Wide Range ◽

Laboratory Protocol

AbstractepiGBS is an existing reduced representation bisulfite sequencing method to determine cytosine methylation and genetic polymorphisms de novo. Here, we present epiGBS2, an improved epiGBS laboratory protocol and user-friendly bioinformatics pipeline for a wide range of species with or without reference genome. epiGBS2 decreases costs and time investment and increases user-friendliness and reproducibility. The library protocol was adjusted to allow for a flexible choice of restriction enzymes and a double digest. Instead of fully methylated adapters, semi-methylated adapters are now used. The bioinformatics pipeline was improved in speed and integrated in the snakemake workflow management system, which now makes the pipeline easy to execute, modular, and parameter settings flexible. We also provide a detailed description of the laboratory protocol, an extensive manual of the bioinformatics pipeline, which is publicly accessible on github (https://github.com/nioo-knaw/epiGBS2) and zenodo (https://doi.org/10.5281/zenodo.3819996), and example output.

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

Plasmid Profiler: Comparative Analysis of Plasmid Content in WGS Data

10.1101/121350 ◽

2017 ◽

Cited By ~ 2

Author(s):

Adrian Zetner ◽

Jennifer Cabral ◽

Laura Mataseje ◽

Natalie C Knox ◽

Philip Mabon ◽

...

Keyword(s):

Comparative Analysis ◽

De Novo ◽

Sequence Data ◽

Health Agency ◽

R Package ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Supplementary Information ◽

Plasmid Content ◽

Link Type

AbstractSummaryComparative analysis of bacterial plasmids from whole genome sequence (WGS) data generated from short read sequencing is challenging. This is due to the difficulty in identifying contigs harbouring plasmid sequence data, and further difficulty in assembling such contigs into a full plasmid. As such, few software programs and bioinformatics pipelines exist to perform comprehensive comparative analyses of plasmids within and amongst sequenced isolates. To address this gap, we have developed Plasmid Profiler, a pipeline to perform comparative plasmid content analysis without the need forde novoassembly. The pipeline is designed to rapidly identify plasmid sequences by mapping reads to a plasmid reference sequence database. Predicted plasmid sequences are then annotated with their incompatibility group, if known. The pipeline allows users to query plasmids for genes or regions of interest and visualize results as an interactive heat map.Availability and ImplementationPlasmid Profiler is freely available software released under the Apache 2.0 open source software license. A stand-alone version of the entire Plasmid Profiler pipeline is available as a Docker container athttps://hub.docker.com/r/phacnml/plasmidprofiler_0_1_6/.The conda recipe for the Plasmid R package is available at:https://anaconda.org/bioconda/r-plasmidprofilerThe custom Plasmid Profiler R package is also available as a CRAN package athttps://cran.r-project.org/web/packages/Plasmidprofiler/index.htmlGalaxy tools associated with the pipeline are available as a Galaxy tool suite athttps://toolshed.g2.bx.psu.edu/repository?repository_id=55e082200d16a504The source code is available at:https://github.com/phac-nml/plasmidprofilerThe Galaxy implementation is available at:https://github.com/phac-nml/plasmidprofiler-galaxyContactEmail:[email protected]: National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, Manitoba, CanadaSupplementary informationDocumentation:http://plasmid-profiler.readthedocs.io/en/latest/

ssbio: A Python Framework for Structural Systems Biology

10.1101/165506 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nathan Mih ◽

Elizabeth Brunk ◽

Ke Chen ◽

Edward Catoiu ◽

Anand Sastry ◽

...

Keyword(s):

Structural Information ◽

Protein Structures ◽

Structural Data ◽

Third Party ◽

Supplementary Information ◽

Link Type ◽

Protein Properties ◽

Scale Network ◽

Structural Systems Biology ◽

Genome Scale

AbstractSummaryWorking with protein structures at the genome-scale has been challenging in a variety of ways. Here, we present ssbio, a Python package that provides a framework to easily work with structural information in the context of genome-scale network reconstructions, which can contain thousands of individual proteins. The ssbio package provides an automated pipeline to construct high quality genome-scale models with protein structures (GEM-PROs), wrappers to popular third-party programs to compute associated protein properties, and methods to visualize and annotate structures directly in Jupyter notebooks, thus lowering the barrier of linking 3D structural data with established systems workflows.Availability and Implementationssbio is implemented in Python and available to download under the MIT license at http://github.com/SBRG/ssbio. Documentation and Jupyter notebook tutorials are available at http://ssbio.readthedocs.io/en/latest/. Interactive notebooks can be launched using Binder at https://mybinder.org/v2/gh/SBRG/ssbio/[email protected] InformationSupplementary data are available at Bioinformatics online.

HAMAP rules as SPARQL A portable annotation pipeline for genomes and proteomes

10.1101/615294 ◽

2019 ◽

Author(s):

Jerven Bolleman ◽

Eduoard de Castro ◽

Delphine Baratin ◽

Sebastien Gehant ◽

Beatrice A. Cuche ◽

...

Keyword(s):

Data Quality ◽

Protein Sequences ◽

Cost Effective ◽

Supplementary Information ◽

Biological Knowledge ◽

Supplementary Data ◽

Annotation Pipeline ◽

Link Type ◽

Supplementary Material

AbstractMotivationGenome and proteome annotation pipelines are generally custom built and therefore not easily reusable by other groups, which leads to duplication of effort, increased costs, and suboptimal results. One cost-effective way to increase the data quality in public databases is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation.ResultsWe have translated the rules of our HAMAP proteome annotation pipeline to queries in the W3C standard SPARQL 1.1 syntax and applied them with two off-the-shelf SPARQL engines to UniProtKB/Swiss-Prot protein sequences described in RDF format. This approach is applicable to any genome or proteome annotation pipeline and greatly simplifies their reuse.AvailabilityHAMAP SPARQL rules and documentation are freely available for download from the HAMAP FTP site ftp://ftp.expasy.org/databases/hamap/hamapsparql.tar.gz under a CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 [email protected] informationSupplementary data are included at the end of this document.

CSMD: a computational subtraction-based microbiome discovery pipeline for species-level characterization of clinical metagenomic samples

Bioinformatics ◽

10.1093/bioinformatics/btz790 ◽

2019 ◽

Author(s):

Yu Liu ◽

Paul W Bible ◽

Bin Zou ◽

Qiaoxing Liang ◽

Cong Dong ◽

...

Keyword(s):

Species Level ◽

Supplementary Information ◽

Clinical Samples ◽

Plasma Dna ◽

Bioinformatics Pipeline ◽

Microbial Identification ◽

Reference Databases ◽

Microbial Dna ◽

Higher Sensitivity

Abstract Motivation Microbiome analyses of clinical samples with low microbial biomass are challenging because of the very small quantities of microbial DNA relative to the human host, ubiquitous contaminating DNA in sequencing experiments and the large and rapidly growing microbial reference databases. Results We present computational subtraction-based microbiome discovery (CSMD), a bioinformatics pipeline specifically developed to generate accurate species-level microbiome profiles for clinical samples with low microbial loads. CSMD applies strategies for the maximal elimination of host sequences with minimal loss of microbial signal and effectively detects microorganisms present in the sample with minimal false positives using a stepwise convergent solution. CSMD was benchmarked in a comparative evaluation with other classic tools on previously published well-characterized datasets. It showed higher sensitivity and specificity in host sequence removal and higher specificity in microbial identification, which led to more accurate abundance estimation. All these features are integrated into a free and easy-to-use tool. Additionally, CSMD applied to cell-free plasma DNA showed that microbial diversity within these samples is substantially broader than previously believed. Availability and implementation CSMD is freely available at https://github.com/liuyu8721/csmd. Supplementary information Supplementary data are available at Bioinformatics online.

MUM&Co: accurate detection of all SV types through whole-genome alignment

Bioinformatics ◽

10.1093/bioinformatics/btaa115 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3242-3243 ◽

Cited By ~ 2

Author(s):

Samuel O’Donnell ◽

Gilles Fischer

Keyword(s):

De Novo ◽

Supplementary Information ◽

Genome Alignment ◽

Whole Genome ◽

Structural Variations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Human Genomes ◽

Whole Genome Alignment ◽

Primary Output

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.