The DOE JGI Metagenome Workflow

ABSTRACTThe DOE JGI Metagenome Workflow performs metagenome data processing, including assembly, structural, functional, and taxonomic annotation, and binning of metagenomic datasets that are subsequently included into the Integrated Microbial Genomes and Microbiomes (IMG/M) comparative analysis system (I. Chen, K. Chu, K. Palaniappan, M. Pillay, A. Ratner, J. Huang, M. Huntemann, N. Varghese, J. White, R. Seshadri, et al, Nucleic Acids Rsearch, 2019) and provided for download via the Joint Genome Institute (JGI) Data Portal (https://genome.jgi.doe.gov/portal/). This workflow scales to run on thousands of metagenome samples per year, which can vary by the complexity of microbial communities and sequencing depth. Here we describe the different tools, databases, and parameters used at different steps of the workflow, to help with interpretation of metagenome data available in IMG and to enable researchers to apply this workflow to their own data. We use 20 publicly available sediment metagenomes to illustrate the computing requirements for the different steps and highlight the typical results of data processing. The workflow modules for read filtering and metagenome assembly are available as a Workflow Description Language (WDL) file (https://code.jgi.doe.gov/BFoster/jgi_meta_wdl.git). The workflow modules for annotation and binning are provided as a service to the user community at https://img.jgi.doe.gov/submit and require filling out the project and associated metadata descriptions in Genomes OnLine Database (GOLD) (S. Mukherjee, D. Stamatis, J. Bertsch, G. Ovchinnikova, H. Katta, A. Mojica, I Chen, and N. Kyrpides, and T. Reddy, Nucleic Acids Research, 2018).IMPORTANCEThe DOE JGI Metagenome Workflow is designed for processing metagenomic datasets starting from Illumina fastq files. It performs data pre-processing, error correction, assembly, structural and functional annotation, and binning. The results of processing are provided in several standard formats, such as fasta and gff and can be used for subsequent integration into the Integrated Microbial Genome (IMG) system where they can be compared to a comprehensive set of publicly available metagenomes. As of 7/30/2020 7,155 JGI metagenomes have been processed by the JGI Metagenome Workflow.

Download Full-text

DOE JGI Metagenome Workflow

mSystems ◽

10.1128/msystems.00804-20 ◽

2021 ◽

Vol 6 (3) ◽

Author(s):

Alicia Clum ◽

Marcel Huntemann ◽

Brian Bushnell ◽

Brian Foster ◽

Bryce Foster ◽

...

Keyword(s):

Nucleic Acids ◽

Data Processing ◽

Microbial Communities ◽

Taxonomic Composition ◽

Metagenomic Data ◽

Data Sets ◽

Microbial Genomes ◽

Link Type ◽

Analysis System ◽

Rich Data

ABSTRACT The DOE Joint Genome Institute (JGI) Metagenome Workflow performs metagenome data processing, including assembly; structural, functional, and taxonomic annotation; and binning of metagenomic data sets that are subsequently included into the Integrated Microbial Genomes and Microbiomes (IMG/M) (I.-M. A. Chen, K. Chu, K. Palaniappan, A. Ratner, et al., Nucleic Acids Res, 49:D751–D763, 2021, https://doi.org/10.1093/nar/gkaa939) comparative analysis system and provided for download via the JGI data portal (https://genome.jgi.doe.gov/portal/). This workflow scales to run on thousands of metagenome samples per year, which can vary by the complexity of microbial communities and sequencing depth. Here, we describe the different tools, databases, and parameters used at different steps of the workflow to help with the interpretation of metagenome data available in IMG and to enable researchers to apply this workflow to their own data. We use 20 publicly available sediment metagenomes to illustrate the computing requirements for the different steps and highlight the typical results of data processing. The workflow modules for read filtering and metagenome assembly are available as a workflow description language (WDL) file (https://code.jgi.doe.gov/BFoster/jgi_meta_wdl). The workflow modules for annotation and binning are provided as a service to the user community at https://img.jgi.doe.gov/submit and require filling out the project and associated metadata descriptions in the Genomes OnLine Database (GOLD) (S. Mukherjee, D. Stamatis, J. Bertsch, G. Ovchinnikova, et al., Nucleic Acids Res, 49:D723–D733, 2021, https://doi.org/10.1093/nar/gkaa983). IMPORTANCE The DOE JGI Metagenome Workflow is designed for processing metagenomic data sets starting from Illumina fastq files. It performs data preprocessing, error correction, assembly, structural and functional annotation, and binning. The results of processing are provided in several standard formats, such as fasta and gff, and can be used for subsequent integration into the Integrated Microbial Genomes and Microbiomes (IMG/M) system where they can be compared to a comprehensive set of publicly available metagenomes. As of 30 July 2020, 7,155 JGI metagenomes have been processed by the DOE JGI Metagenome Workflow. Here, we present a metagenome workflow developed at the JGI that generates rich data in standard formats and has been optimized for downstream analyses ranging from assessment of the functional and taxonomic composition of microbial communities to genome-resolved metagenomics and the identification and characterization of novel taxa. This workflow is currently being used to analyze thousands of metagenomic data sets in a consistent and standardized manner.

Download Full-text

METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

BMC Bioinformatics ◽

10.1186/s12859-021-04284-4 ◽

2021 ◽

Vol 22 (S10) ◽

Author(s):

Zhenmiao Zhang ◽

Lu Zhang

Keyword(s):

De Novo ◽

Label Propagation ◽

Next Generation Sequencing Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Fecal Samples ◽

Microbial Genomes ◽

Metagenome Assembly ◽

High Chance ◽

Mock Communities

Abstract Background Due to the complexity of microbial communities, de novo assembly on next generation sequencing data is commonly unable to produce complete microbial genomes. Metagenome assembly binning becomes an essential step that could group the fragmented contigs into clusters to represent microbial genomes based on contigs’ nucleotide compositions and read depths. These features work well on the long contigs, but are not stable for the short ones. Contigs can be linked by sequence overlap (assembly graph) or by the paired-end reads aligned to them (PE graph), where the linked contigs have high chance to be derived from the same clusters. Results We developed METAMVGL, a multi-view graph-based metagenomic contig binning algorithm by integrating both assembly and PE graphs. It could strikingly rescue the short contigs and correct the binning errors from dead ends. METAMVGL learns the two graphs’ weights automatically and predicts the contig labels in a uniform multi-view label propagation framework. In experiments, we observed METAMVGL made use of significantly more high-confidence edges from the combined graph and linked dead ends to the main graph. It also outperformed many state-of-the-art contig binning algorithms, including MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and GraphBin on the metagenomic sequencing data from simulation, two mock communities and Sharon infant fecal samples. Conclusions Our findings demonstrate METAMVGL outstandingly improves the short contig binning and outperforms the other existing contig binning tools on the metagenomic sequencing data from simulation, mock communities and infant fecal samples.

Download Full-text

AIDS data processing and analysis system

10.2514/6.1967-792 ◽

1967 ◽

Author(s):

C. FAIRBANK

Keyword(s):

Data Processing ◽

Analysis System

Download Full-text

The FAST-HEP toolset: Using YAML to make tables out of trees

EPJ Web of Conferences ◽

10.1051/epjconf/202024506016 ◽

2020 ◽

Vol 245 ◽

pp. 06016

Author(s):

Benjamin Edward Krikler ◽

Olivier Davignon ◽

Lukasz Kreczko ◽

Jacob Linacre

Keyword(s):

Data Processing ◽

Analysis Software ◽

Description Language ◽

Industry Standard ◽

Level Data ◽

European Group ◽

Proof Of Principle

The Faster Analysis Software Taskforce (FAST) is a small, European group of HEP researchers that have been investigating and developing modern software approaches to improve HEP analyses. We present here an overview of the key product of this effort: a set of packages that allows a complete implementation of an analysis using almost exclusively YAML files. Serving as an analysis description language (ADL), this toolset builds on top of the evolving technologies from the Scikit-HEP and IRIS-HEP projects as well as industry-standard libraries such as Pandas and Matplotlib. Data processing starts with event-level data (the trees) and can proceed by adding variables, selecting events, performing complex user-defined operations and binning data, as defined in the YAML description. The resulting outputs (the tables) are stored as Pandas dataframes which can be programmatically manipulated and converted to plots or inputs for fitting frameworks. No longer just a proof-of-principle, these tools are now being used in CMS analyses, the LUX-ZEPLIN experiment, and by students on several other experiments. In this talk we will showcase these tools through examples, highlighting how they address the different experiments’ needs, and compare them to other similar approaches.

Download Full-text

Data Processing Error in: Fluoxetine After Weight Restoration in Anorexia Nervosa: A Randomized Controlled Trial

JAMA ◽

10.1001/jama.298.17.2008 ◽

2007 ◽

Vol 298 (17) ◽

pp. 2008

Keyword(s):

Randomized Controlled Trial ◽

Anorexia Nervosa ◽

Data Processing ◽

Controlled Trial ◽

Randomized Controlled ◽

Processing Error ◽

Weight Restoration

Download Full-text

Secondary Development of DiTeSt-STA202 Based on Delphi and Access

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.36.103 ◽

2010 ◽

Vol 36 ◽

pp. 103-108

Author(s):

Kuan Fang He ◽

Ping Yu Zhu ◽

Xue Jun Li ◽

D.M. Xiao

Keyword(s):

Optical Fiber ◽

Data Processing ◽

Data Storage ◽

Secondary Development ◽

Typical Application ◽

Sensing Technology ◽

Optical Fiber Sensing ◽

Analysis System ◽

Distributed Optical Fiber ◽

User Friendly

Distributed optical fiber sensing technology in the dam safety monitoring has been applied. The most typical application is the Swiss distributed optical fiber DiTeSt-STA202 analyzer. The instrument can simultaneously measure the distribution of thousands of points in a fiber strian and temperature, and it is suitable for measurement of objects up to several kilometers, but it’s functions limit to the data acquisition and the most primitive conversion of the strain and temperature without the deeper data processing functions, and the interface is less of user-friendly nature and poor adaptability. In this paper, The secondary development of DiTeSt-STA202 is done by Delphi and ADO technology based on analysis of the data storage types and formats. The data processing and analysis system is achieved in an efficient and accurate way.

Download Full-text

Optimization of the spectral data processing in a LIBS simultaneous elemental analysis system

Spectrochimica Acta Part B Atomic Spectroscopy ◽

10.1016/s0584-8547(01)00186-0 ◽

2001 ◽

Vol 56 (6) ◽

pp. 725-736 ◽

Cited By ~ 109

Author(s):

D. Body ◽

B.L. Chadwick

Keyword(s):

Data Processing ◽

Spectral Data ◽

Elemental Analysis ◽

Analysis System

Download Full-text

A low-cost microcomputer-based data acquisition an analysis system for an electron spin resonance spectrometer: Data handling of dilute spin labeled nucleic acids

Journal of Biochemical and Biophysical Methods ◽

10.1016/0165-022x(83)90020-9 ◽

1983 ◽

Vol 8 (1) ◽

pp. 49-56 ◽

Cited By ~ 13

Author(s):

John C. Ireland ◽

Jane A. Willett ◽

Albert M. Bobst

Keyword(s):

Electron Spin Resonance ◽

Nucleic Acids ◽

Data Acquisition ◽

Electron Spin ◽

Low Cost ◽

Data Handling ◽

Spin Resonance ◽

Analysis System ◽

Spectrometer Data ◽

Electron Spin Resonance Spectrometer

Download Full-text

Genome Information Broker (GIB): data retrieval and comparative analysis system for completed microbial genomes and more

Nucleic Acids Research ◽

10.1093/nar/30.1.66 ◽

2002 ◽

Vol 30 (1) ◽

pp. 66-68 ◽

Cited By ~ 15

Author(s):

M. Fumoto

Keyword(s):

Comparative Analysis ◽

Data Retrieval ◽

Microbial Genomes ◽

Analysis System ◽

Genome Information ◽

Information Broker

Download Full-text

DIAFree enables untargeted open-search identification for Data-Independent Acquisition data

10.1101/2020.08.30.274209 ◽

2020 ◽

Author(s):

Iris Xu

Keyword(s):

Data Processing ◽

High Throughput ◽

Protein Analysis ◽

Processing Algorithm ◽

Software Suite ◽

Data Processing Algorithm ◽

Link Type ◽

Data Independent Acquisition ◽

Spectral Libraries

AbstractAs a reliable and high-throughput proteomics strategy, data-independent acquisition (DIA) has shown great potential for protein analysis. However, DIA also imposes stress on the data processing algorithm by generating complex multiplexed spectra. Traditionally, DIA data is processed using spectral libraries refined from experiment histories, which requires stable experiment conditions and additional runs. Furthermore, scientists still need to use library-free tools to generate spectral libraries from additional runs. To lessen those burdens, here we present DIAFree(https://github.com/xuesu/DIAFree), a library-free, tag-index-based software suite that enables both restrict search and open search on DIA data using the information of MS1 scans in a precursor-centric and spectrum-centric style. We validate the quality of detection by publicly available data. We further evaluate the quality of spectral libraries produced by DIAFree.

Download Full-text