bioinformatics workflows Latest Research Papers

Challenges in Bioinformatics Workflows for Processing Microbiome Omics Data at Scale

Frontiers in Bioinformatics ◽

10.3389/fbinf.2021.826370 ◽

2022 ◽

Vol 1 ◽

Author(s):

Bin Hu ◽

Shane Canon ◽

Emiley A. Eloe-Fadrosh ◽

Anubhav ◽

Michal Babinski ◽

...

Keyword(s):

Open Access ◽

Omics Data ◽

Metabolomic Data ◽

Or Groups ◽

Microbiome Research ◽

Descriptive Approach ◽

Dynamics And Function ◽

And Function ◽

Microbiome Data ◽

Bioinformatics Workflows

The nascent field of microbiome science is transitioning from a descriptive approach of cataloging taxa and functions present in an environment to applying multi-omics methods to investigate microbiome dynamics and function. A large number of new tools and algorithms have been designed and used for very specific purposes on samples collected by individual investigators or groups. While these developments have been quite instructive, the ability to compare microbiome data generated by many groups of researchers is impeded by the lack of standardized application of bioinformatics methods. Additionally, there are few examples of broad bioinformatics workflows that can process metagenome, metatranscriptome, metaproteome and metabolomic data at scale, and no central hub that allows processing, or provides varied omics data that are findable, accessible, interoperable and reusable (FAIR). Here, we review some of the challenges that exist in analyzing omics data within the microbiome research sphere, and provide context on how the National Microbiome Data Collaborative has adopted a standardized and open access approach to address such challenges.

Download Full-text

Resource Prediction Service for Efficient Execution of Bioinformatics Workflows in Federated Cloud with Machine Learning

10.1109/bibm52615.2021.9669152 ◽

2021 ◽

Author(s):

Matheus Sobrinho ◽

Michel Rosa ◽

Waldeyr Silva ◽

Aleteia Araujo

Keyword(s):

Machine Learning ◽

Bioinformatics Workflows ◽

Resource Prediction ◽

Efficient Execution

Download Full-text

BioProv - A provenance library for bioinformatics workflows

The Journal of Open Source Software ◽

10.21105/joss.03622 ◽

2021 ◽

Vol 6 (67) ◽

pp. 3622

Author(s):

Vinícius Salazar ◽

João Cavalcante ◽

Daniel de Oliveira ◽

Fabiano Thompson ◽

Marta Mattoso

Keyword(s):

Bioinformatics Workflows

Download Full-text

Validation strategy of a bioinformatics whole genome sequencing workflow for Shiga toxin-producing Escherichia coli using a reference collection extensively characterized with conventional methods

Microbial Genomics ◽

10.1099/mgen.0.000531 ◽

2021 ◽

Author(s):

Bert Bogaerts ◽

Stéphanie Nouws ◽

Bavo Verhaegen ◽

Sarah Denayer ◽

Julien Van Braekel ◽

...

Keyword(s):

Escherichia Coli ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Shiga Toxin ◽

Complete Characterization ◽

Whole Genome ◽

Read Mapping ◽

Validation Strategy ◽

Bioinformatics Workflows

Whole genome sequencing (WGS) enables complete characterization of bacterial pathogenic isolates at single nucleotide resolution, making it the ultimate tool for routine surveillance and outbreak investigation. The lack of standardization, and the variation regarding bioinformatics workflows and parameters, however, complicates interoperability among (inter)national laboratories. We present a validation strategy applied to a bioinformatics workflow for Illumina data that performs complete characterization of Shiga toxin-producing Escherichia coli (STEC) isolates including antimicrobial resistance prediction, virulence gene detection, serotype prediction, plasmid replicon detection and sequence typing. The workflow supports three commonly used bioinformatics approaches for the detection of genes and alleles: alignment with blast+, kmer-based read mapping with KMA, and direct read mapping with SRST2. A collection of 131 STEC isolates collected from food and human sources, extensively characterized with conventional molecular methods, was used as a validation dataset. Using a validation strategy specifically adopted to WGS, we demonstrated high performance with repeatability, reproducibility, accuracy, precision, sensitivity and specificity above 95 % for the majority of all assays. The WGS workflow is publicly available as a ‘push-button’ pipeline at https://galaxy.sciensano.be. Our validation strategy and accompanying reference dataset consisting of both conventional and WGS data can be used for characterizing the performance of various bioinformatics workflows and assays, facilitating interoperability between laboratories with different WGS and bioinformatics set-ups.

Download Full-text

Interpretable detection of novel human viruses from genome sequencing data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab004 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Jakub M Bartoszewicz ◽

Anja Seidel ◽

Bernhard Y Renard

Keyword(s):

Error Rates ◽

Regions Of Interest ◽

Sequencing Data ◽

Learning Skills ◽

New Approach ◽

Nucleotide Resolution ◽

Human Viruses ◽

Reliable Methods ◽

Generation Sequencing ◽

Bioinformatics Workflows

Abstract Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.

Download Full-text

XenoCell: classification of cellular barcodes in single cell experiments from xenograft samples

BMC Medical Genomics ◽

10.1186/s12920-021-00872-8 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Stefano Cheloni ◽

Roman Hillje ◽

Lucilla Luzi ◽

Pier Giuseppe Pelicci ◽

Elena Gatti

Keyword(s):

Single Cell ◽

Open Source ◽

Cell Line ◽

Mixed Species ◽

Single Cell Sequencing ◽

Sequencing Technologies ◽

Cell Experiment ◽

Reliable Classification ◽

Bioinformatics Workflows

Abstract Background Single-cell sequencing technologies provide unprecedented opportunities to deconvolve the genomic, transcriptomic or epigenomic heterogeneity of complex biological systems. Its application in samples from xenografts of patient-derived biopsies (PDX), however, is limited by the presence of cells originating from both the host and the graft in the analysed samples; in fact, in the bioinformatics workflows it is still a challenge discriminating between host and graft sequence reads obtained in a single-cell experiment. Results We have developed XenoCell, the first stand-alone pre-processing tool that performs fast and reliable classification of host and graft cellular barcodes from single-cell sequencing experiments. We show its application on a mixed species 50:50 cell line experiment from 10× Genomics platform, and on a publicly available PDX dataset obtained by Drop-Seq. Conclusions XenoCell accurately dissects sequence reads from any host and graft combination of species as well as from a broad range of single-cell experiments and platforms. It is open source and available at https://gitlab.com/XenoCell/XenoCell.

Download Full-text

Bioinformatics workflows for clinical applications in precision oncology

Seminars in Cancer Biology ◽

10.1016/j.semcancer.2020.12.020 ◽

2021 ◽

Author(s):

Natalie Jäger

Keyword(s):

Clinical Applications ◽

Precision Oncology ◽

Bioinformatics Workflows

Download Full-text

aCLImatise: automated generation of tool definitions for bioinformatics workflows

Bioinformatics ◽

10.1093/bioinformatics/btaa1033 ◽

2020 ◽

Author(s):

Michael Milton ◽

Natalie Thorne

Keyword(s):

Source Code ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Automated Generation ◽

Base Camp ◽

Python Package ◽

Bioinformatics Workflow ◽

Bioinformatics Workflows

Abstract Summary aCLImatise is a utility for automatically generating tool definitions compatible with bioinformatics workflow languages, by parsing command-line help output. aCLImatise also has an associated database called the aCLImatise Base Camp, which provides thousands of pre-computed tool definitions. Availability and implementation The latest aCLImatise source code is available within a GitHub organisation, under the GPL-3.0 license: https://github.com/aCLImatise. In particular, documentation for the aCLImatise Python package is available at https://aclimatise.github.io/CliHelpParser/, and the aCLImatise Base Camp is available at https://aclimatise.github.io/BaseCamp/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Towards end-to-end disease prediction from raw metagenomic data

10.1101/2020.10.29.360297 ◽

2020 ◽

Author(s):

Maxence Queyrel ◽

Edi Prifti ◽

Jean-Daniel Zucker

Keyword(s):

Dna Sequences ◽

Real Life ◽

Multiple Instance Learning ◽

Disease Classification ◽

Metagenomic Data ◽

Numerical Representation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

End To End ◽

Bioinformatics Workflows

AbstractAnalysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and are stored as fastq files. Conventional processing pipelines consist multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Recent studies have demonstrated that training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimentionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life datasets as well a simulated one, we demonstrated that this original approach reached very high performances, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Download Full-text

Interpretable detection of novel human viruses from genome sequencing data

10.1101/2020.01.29.925354 ◽

2020 ◽

Cited By ~ 5

Author(s):

Jakub M. Bartoszewicz ◽

Anja Seidel ◽

Bernhard Y. Renard

Keyword(s):

Error Rates ◽

Regions Of Interest ◽

Sequencing Data ◽

Learning Skills ◽

New Approach ◽

Nucleotide Resolution ◽

Human Viruses ◽

Reliable Methods ◽

Generation Sequencing ◽

Bioinformatics Workflows

ABSTRACTViruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.

Download Full-text

bioinformatics workflows
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Challenges in Bioinformatics Workflows for Processing Microbiome Omics Data at Scale

Resource Prediction Service for Efficient Execution of Bioinformatics Workflows in Federated Cloud with Machine Learning

BioProv - A provenance library for bioinformatics workflows

Validation strategy of a bioinformatics whole genome sequencing workflow for Shiga toxin-producing Escherichia coli using a reference collection extensively characterized with conventional methods

Interpretable detection of novel human viruses from genome sequencing data

XenoCell: classification of cellular barcodes in single cell experiments from xenograft samples

Bioinformatics workflows for clinical applications in precision oncology

aCLImatise: automated generation of tool definitions for bioinformatics workflows

Towards end-to-end disease prediction from raw metagenomic data

Interpretable detection of novel human viruses from genome sequencing data

Export Citation Format

bioinformatics workflowsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Challenges in Bioinformatics Workflows for Processing Microbiome Omics Data at Scale

Resource Prediction Service for Efficient Execution of Bioinformatics Workflows in Federated Cloud with Machine Learning

BioProv - A provenance library for bioinformatics workflows

Validation strategy of a bioinformatics whole genome sequencing workflow for Shiga toxin-producing Escherichia coli using a reference collection extensively characterized with conventional methods

Interpretable detection of novel human viruses from genome sequencing data

XenoCell: classification of cellular barcodes in single cell experiments from xenograft samples

Bioinformatics workflows for clinical applications in precision oncology

aCLImatise: automated generation of tool definitions for bioinformatics workflows

Towards end-to-end disease prediction from raw metagenomic data

Interpretable detection of novel human viruses from genome sequencing data

bioinformatics workflows
Recently Published Documents