Developing a Reproducible Microbiome Data Analysis Pipeline Using the Amazon Web Services Cloud for a Cancer Research Group: Proof-of-Concept Study (Preprint)

BACKGROUND Cloud computing for microbiome data sets can significantly increase working efficiencies and expedite the translation of research findings into clinical practice. The Amazon Web Services (AWS) cloud provides an invaluable option for microbiome data storage, computation, and analysis. OBJECTIVE The goals of this study were to develop a microbiome data analysis pipeline by using AWS cloud and to conduct a proof-of-concept test for microbiome data storage, processing, and analysis. METHODS A multidisciplinary team was formed to develop and test a reproducible microbiome data analysis pipeline with multiple AWS cloud services that could be used for storage, computation, and data analysis. The microbiome data analysis pipeline developed in AWS was tested by using two data sets: 19 vaginal microbiome samples and 50 gut microbiome samples. RESULTS Using AWS features, we developed a microbiome data analysis pipeline that included Amazon Simple Storage Service for microbiome sequence storage, Linux Elastic Compute Cloud (EC2) instances (ie, servers) for data computation and analysis, and security keys to create and manage the use of encryption for the pipeline. Bioinformatics and statistical tools (ie, Quantitative Insights Into Microbial Ecology 2 and RStudio) were installed within the Linux EC2 instances to run microbiome statistical analysis. The microbiome data analysis pipeline was performed through command-line interfaces within the Linux operating system or in the Mac operating system. Using this new pipeline, we were able to successfully process and analyze 50 gut microbiome samples within 4 hours at a very low cost (a c4.4xlarge EC2 instance costs $0.80 per hour). Gut microbiome findings regarding diversity, taxonomy, and abundance analyses were easily shared within our research team. CONCLUSIONS Building a microbiome data analysis pipeline with AWS cloud is feasible. This pipeline is highly reliable, computationally powerful, and cost effective. Our AWS-based microbiome analysis pipeline provides an efficient tool to conduct microbiome data analysis.

Download Full-text

Developing a Reproducible Microbiome Data Analysis Pipeline Using the Amazon Web Services Cloud for a Cancer Research Group: Proof-of-Concept Study

JMIR Medical Informatics ◽

10.2196/14667 ◽

2019 ◽

Vol 7 (4) ◽

pp. e14667 ◽

Cited By ~ 1

Author(s):

Jinbing Bai ◽

Ileen Jhaney ◽

Jessica Wells

Keyword(s):

Data Analysis ◽

Data Storage ◽

Gut Microbiome ◽

Data Sets ◽

Proof Of Concept ◽

Analysis Pipeline ◽

Amazon Web Services ◽

Microbiome Data ◽

Data Analysis Pipeline ◽

Microbiome Data Analysis

Background Cloud computing for microbiome data sets can significantly increase working efficiencies and expedite the translation of research findings into clinical practice. The Amazon Web Services (AWS) cloud provides an invaluable option for microbiome data storage, computation, and analysis. Objective The goals of this study were to develop a microbiome data analysis pipeline by using AWS cloud and to conduct a proof-of-concept test for microbiome data storage, processing, and analysis. Methods A multidisciplinary team was formed to develop and test a reproducible microbiome data analysis pipeline with multiple AWS cloud services that could be used for storage, computation, and data analysis. The microbiome data analysis pipeline developed in AWS was tested by using two data sets: 19 vaginal microbiome samples and 50 gut microbiome samples. Results Using AWS features, we developed a microbiome data analysis pipeline that included Amazon Simple Storage Service for microbiome sequence storage, Linux Elastic Compute Cloud (EC2) instances (ie, servers) for data computation and analysis, and security keys to create and manage the use of encryption for the pipeline. Bioinformatics and statistical tools (ie, Quantitative Insights Into Microbial Ecology 2 and RStudio) were installed within the Linux EC2 instances to run microbiome statistical analysis. The microbiome data analysis pipeline was performed through command-line interfaces within the Linux operating system or in the Mac operating system. Using this new pipeline, we were able to successfully process and analyze 50 gut microbiome samples within 4 hours at a very low cost (a c4.4xlarge EC2 instance costs $0.80 per hour). Gut microbiome findings regarding diversity, taxonomy, and abundance analyses were easily shared within our research team. Conclusions Building a microbiome data analysis pipeline with AWS cloud is feasible. This pipeline is highly reliable, computationally powerful, and cost effective. Our AWS-based microbiome analysis pipeline provides an efficient tool to conduct microbiome data analysis.

Download Full-text

xCELLanalyzer: A Framework for the Analysis of Cellular Impedance Measurements for Mode of Action Discovery

SLAS DISCOVERY Advancing Life Sciences ◽

10.1177/2472555218819459 ◽

2019 ◽

Vol 24 (3) ◽

pp. 213-223 ◽

Cited By ~ 1

Author(s):

Raimo Franke ◽

Bettina Hinkelmann ◽

Verena Fetz ◽

Theresia Stradal ◽

Florenz Sasse ◽

...

Keyword(s):

Data Analysis ◽

Bioactive Compounds ◽

Mode Of Action ◽

Mammalian Cells ◽

Cellular Response ◽

Label Free ◽

Synthesis Inhibitor ◽

Analysis Pipeline ◽

Bioactive Natural Products ◽

Data Analysis Pipeline

Mode of action (MoA) identification of bioactive compounds is very often a challenging and time-consuming task. We used a label-free kinetic profiling method based on an impedance readout to monitor the time-dependent cellular response profiles for the interaction of bioactive natural products and other small molecules with mammalian cells. Such approaches have been rarely used so far due to the lack of data mining tools to properly capture the characteristics of the impedance curves. We developed a data analysis pipeline for the xCELLigence Real-Time Cell Analysis detection platform to process the data, assess and score their reproducibility, and provide rank-based MoA predictions for a reference set of 60 bioactive compounds. The method can reveal additional, previously unknown targets, as exemplified by the identification of tubulin-destabilizing activities of the RNA synthesis inhibitor actinomycin D and the effects on DNA replication of vioprolide A. The data analysis pipeline is based on the statistical programming language R and is available to the scientific community through a GitHub repository.

Download Full-text

Design and implementation of a data analysis pipeline for paper spray ionization mass spectrometry of UO2Cl2

10.2172/1498917 ◽

2018 ◽

Author(s):

Samuel Koby ◽

Joseph Mannion ◽

Joshua Hewitt ◽

Matthew Wellons

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

Ionization Mass Spectrometry ◽

Ionization Mass ◽

Analysis Pipeline ◽

Paper Spray ◽

Paper Spray Ionization ◽

Design And Implementation ◽

Data Analysis Pipeline

Download Full-text

A De Novo-Assembly Based Data Analysis Pipeline for Plant Obligate Parasite Metatranscriptomic Studies

Frontiers in Plant Science ◽

10.3389/fpls.2016.00925 ◽

2016 ◽

Vol 7 ◽

Cited By ~ 5

Author(s):

Li Guo ◽

Kelly S. Allen ◽

Greg Deiulio ◽

Yong Zhang ◽

Angela M. Madeiras ◽

...

Keyword(s):

Data Analysis ◽

De Novo Assembly ◽

De Novo ◽

Analysis Pipeline ◽

Obligate Parasite ◽

Data Analysis Pipeline

Download Full-text

Nephele: a cloud platform for simplified, standardized and reproducible microbiome data analysis

Bioinformatics ◽

10.1093/bioinformatics/btx617 ◽

2017 ◽

Vol 34 (8) ◽

pp. 1411-1413 ◽

Cited By ~ 22

Author(s):

Nick Weber ◽

David Liou ◽

Jennifer Dommer ◽

Philip MacMenamin ◽

Mariam Quiñones ◽

...

Keyword(s):

Data Analysis ◽

Cloud Platform ◽

Microbiome Data ◽

Microbiome Data Analysis

Download Full-text

A Wavelet, Fourier, and PCA Data Analysis Pipeline: Application to Distinguishing Mixtures of Liquids.

ChemInform ◽

10.1002/chin.200321210 ◽

2003 ◽

Vol 34 (21) ◽

Author(s):

Muenevver Koekueer ◽

Fionn Murtagh ◽

Norman D. McMillan ◽

Sven Riedel ◽

Brian O'Rourke ◽

...

Keyword(s):

Data Analysis ◽

Analysis Pipeline ◽

Data Analysis Pipeline

Download Full-text

miND pipeline AWS EC2 installation and setup v2

10.17504/protocols.io.b3f6qjre ◽

2022 ◽

Author(s):

Andreas B Diendorfer ◽

Kseniya.Khamina not provided ◽

marianne.pultar not provided

Keyword(s):

Data Analysis ◽

Public Repository ◽

Sequencing Data ◽

Analysis Pipeline ◽

Ngs Data Analysis ◽

Ngs Data ◽

Data Analysis Pipeline

miND is a NGS data analysis pipeline for smallRNA sequencing data. In this protocol, the pipeline is setup and run on an AWS EC2 instance with example data from a public repository. Please see the publication paper on F1000 for more details on the pipeline and how to use it.

Download Full-text

Cascabel: A Scalable and Versatile Amplicon Sequence Data Analysis Pipeline Delivering Reproducible and Documented Results

Frontiers in Genetics ◽

10.3389/fgene.2020.489357 ◽

2020 ◽

Vol 11 ◽

Author(s):

Alejandro Abdala Asbun ◽

Marc A. Besseling ◽

Sergio Balzano ◽

Judith D. L. van Bleijswijk ◽

Harry J. Witte ◽

...

Keyword(s):

Data Analysis ◽

Sequence Data ◽

Single Gene ◽

Marker Gene ◽

Gene Sequencing ◽

Data Generation ◽

Clustering Methods ◽

Analysis Pipeline ◽

Data Analysis Pipeline ◽

Marker Gene Sequencing

Marker gene sequencing of the rRNA operon (16S, 18S, ITS) or cytochrome c oxidase I (CO1) is a popular means to assess microbial communities of the environment, microbiomes associated with plants and animals, as well as communities of multicellular organisms via environmental DNA sequencing. Since this technique is based on sequencing a single gene, or even only parts of a single gene rather than the entire genome, the number of reads needed per sample to assess the microbial community structure is lower than that required for metagenome sequencing. This makes marker gene sequencing affordable to nearly any laboratory. Despite the relative ease and cost-efficiency of data generation, analyzing the resulting sequence data requires computational skills that may go beyond the standard repertoire of a current molecular biologist/ecologist. We have developed Cascabel, a scalable, flexible, and easy-to-use amplicon sequence data analysis pipeline, which uses Snakemake and a combination of existing and newly developed solutions for its computational steps. Cascabel takes the raw data as input and delivers a table of operational taxonomic units (OTUs) or Amplicon Sequence Variants (ASVs) in BIOM and text format and representative sequences. Cascabel is a highly versatile software that allows users to customize several steps of the pipeline, such as selecting from a set of OTU clustering methods or performing ASV analysis. In addition, we designed Cascabel to run in any linux/unix computing environment from desktop computers to computing servers making use of parallel processing if possible. The analyses and results are fully reproducible and documented in an HTML and optional pdf report. Cascabel is freely available at Github: https://github.com/AlejandroAb/CASCABEL.

Download Full-text

MHSNMF: multi-view hessian regularization based symmetric nonnegative matrix factorization for microbiome data analysis

BMC Bioinformatics ◽

10.1186/s12859-020-03555-w ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Yuanyuan Ma ◽

Junmin Zhao ◽

Yingjun Ma

Keyword(s):

Data Analysis ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Rapid Development ◽

Nonnegative Matrix ◽

Metabolomics Data ◽

Normalized Mutual Information ◽

Microbiome Data ◽

Symmetric Nonnegative Matrix Factorization ◽

Microbiome Data Analysis

Abstract Background With the rapid development of high-throughput technique, multiple heterogeneous omics data have been accumulated vastly (e.g., genomics, proteomics and metabolomics data). Integrating information from multiple sources or views is challenging to obtain a profound insight into the complicated relations among micro-organisms, nutrients and host environment. In this paper we propose a multi-view Hessian regularization based symmetric nonnegative matrix factorization algorithm (MHSNMF) for clustering heterogeneous microbiome data. Compared with many existing approaches, the advantages of MHSNMF lie in: (1) MHSNMF combines multiple Hessian regularization to leverage the high-order information from the same cohort of instances with multiple representations; (2) MHSNMF utilities the advantages of SNMF and naturally handles the complex relationship among microbiome samples; (3) uses the consensus matrix obtained by MHSNMF, we also design a novel approach to predict the classification of new microbiome samples. Results We conduct extensive experiments on two real-word datasets (Three-source dataset and Human Microbiome Plan dataset), the experimental results show that the proposed MHSNMF algorithm outperforms other baseline and state-of-the-art methods. Compared with other methods, MHSNMF achieves the best performance (accuracy: 95.28%, normalized mutual information: 91.79%) on microbiome data. It suggests the potential application of MHSNMF in microbiome data analysis. Conclusions Results show that the proposed MHSNMF algorithm can effectively combine the phylogenetic, transporter, and metabolic profiles into a unified paradigm to analyze the relationships among different microbiome samples. Furthermore, the proposed prediction method based on MHSNMF has been shown to be effective in judging the types of new microbiome samples.

Download Full-text

Changes in Gut Microbiome Associated With Co-Occurring Symptoms Development During Chemo-Radiation for Rectal Cancer: A Proof of Concept Study

Biological Research For Nursing ◽

10.1177/1099800420942830 ◽

2020 ◽

Vol 23 (1) ◽

pp. 31-41

Author(s):

Velda J. González-Mercado ◽

Wendy A. Henderson ◽

Anujit Sarkar ◽

Jean Lim ◽

Leorey N. Saligan ◽

...

Keyword(s):

Rectal Cancer ◽

Gut Microbiome ◽

Learning Algorithm ◽

Descriptive Statistics ◽

Proof Of Concept ◽

Random Forest Classification ◽

Forest Classification ◽

Symptom Ratings ◽

Stool Samples ◽

Microbiome Data

Purpose: To examine a) whether there are significant differences in the severity of symptoms of fatigue, sleep disturbance, or depression between patients with rectal cancer who develop co-occurring symptoms and those with no symptoms before and at the end of chemotherapy and radiation therapy (CRT); b) differences in gut microbial diversity between those with co-occurring symptoms and those with no symptoms; and c) whether before-treatment diversity measurements and taxa abundances can predict co-occurrence of symptoms. Methods: Stool samples and symptom ratings were collected from 31 patients with rectal cancer prior to and at the end of (24–28 treatments) CRT. Descriptive statistics were computed and the Mann-Whitney U test was performed for symptoms. Gut microbiome data were analyzed using R’s vegan package software. Results: Participants with co-occurring symptoms reported greater severity of fatigue at the end of CRT than those with no symptoms. Bacteroides and Blautia2 abundances differed between participants with co-occurring symptoms and those with no symptoms. Our random forest classification (unsupervised learning algorithm) predicted participants who developed co-occurring symptoms with 74% accuracy, using specific phylum, family, and genera abundances as predictors. Conclusion: Our preliminary results point to an association between the gut microbiota and co-occurring symptoms in rectal cancer patients and serves as a first step in potential identification of a microbiota-based classifier.

Download Full-text