microeco: An R package for data mining in microbial community ecology

Author(s):  
Chi Liu ◽  
Yaoming Cui ◽  
Xiangzhen Li ◽  
Minjie Yao

Abstract A large amount of sequencing data is produced in microbial community ecology studies using the high-throughput sequencing technique, especially amplicon-sequencing-based community data. After conducting the initial bioinformatic analysis of amplicon sequencing data, performing the subsequent statistics and data mining based on the operational taxonomic unit and taxonomic assignment tables is still complicated and time-consuming. To address this problem, we present an integrated R package-‘microeco’ as an analysis pipeline for treating microbial community and environmental data. This package was developed based on the R6 class system and combines a series of commonly used and advanced approaches in microbial community ecology research. The package includes classes for data preprocessing, taxa abundance plotting, venn diagram, alpha diversity analysis, beta diversity analysis, differential abundance test and indicator taxon analysis, environmental data analysis, null model analysis, network analysis and functional analysis. Each class is designed to provide a set of approaches that can be easily accessible to users. Compared with other R packages in the microbial ecology field, the microeco package is fast, flexible and modularized to use, and provides powerful and convenient tools for researchers. The microeco package can be installed from CRAN (The Comprehensive R Archive Network) or github (https://github.com/ChiLiubio/microeco).

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ellen S. Cameron ◽  
Philip J. Schmidt ◽  
Benjamin J.-M. Tremblay ◽  
Monica B. Emelko ◽  
Kirsten M. Müller

AbstractAmplicon sequencing has revolutionized our ability to study DNA collected from environmental samples by providing a rapid and sensitive technique for microbial community analysis that eliminates the challenges associated with lab cultivation and taxonomic identification through microscopy. In water resources management, it can be especially useful to evaluate ecosystem shifts in response to natural and anthropogenic landscape disturbances to signal potential water quality concerns, such as the detection of toxic cyanobacteria or pathogenic bacteria. Amplicon sequencing data consist of discrete counts of sequence reads, the sum of which is the library size. Groups of samples typically have different library sizes that are not representative of biological variation; library size normalization is required to meaningfully compare diversity between them. Rarefaction is a widely used normalization technique that involves the random subsampling of sequences from the initial sample library to a selected normalized library size. This process is often dismissed as statistically invalid because subsampling effectively discards a portion of the observed sequences, yet it remains prevalent in practice and the suitability of rarefying, relative to many other normalization approaches, for diversity analysis has been argued. Here, repeated rarefying is proposed as a tool to normalize library sizes for diversity analyses. This enables (i) proportionate representation of all observed sequences and (ii) characterization of the random variation introduced to diversity analyses by rarefying to a smaller library size shared by all samples. While many deterministic data transformations are not tailored to produce equal library sizes, repeatedly rarefying reflects the probabilistic process by which amplicon sequencing data are obtained as a representation of the amplified source microbial community. Specifically, it evaluates which data might have been obtained if a particular sample’s library size had been smaller and allows graphical representation of the effects of this library size normalization process upon diversity analysis results.


2019 ◽  
Author(s):  
Igor Segota ◽  
Tao Long

We developed a High-resolution Microbial Analysis Pipeline (HiMAP) for 16S amplicon sequencing data analysis, aiming at bacterial species or strain-level identification from human microbiome to enable experimental validation for causal effects of the associated bacterial strains on health and diseases. HiMAP achieved higher accuracy in identifying species in human microbiome mock community than other pipelines. HiMAP identified majority of the species, with strain-level resolution wherever possible, as detected by whole genome shotgun sequencing using MetaPhlAn2 and reported comparable relative abundances. HiMAP is an open-source R package available at https://github.com/taolonglab/himap.


2021 ◽  
Author(s):  
Philip J Schmidt ◽  
Ellen S Cameron ◽  
Kirsten M Müller ◽  
Monica B Emelko

Diversity analysis of amplicon sequencing data is mainly limited to plug-in estimates calculated using normalized data to obtain a single value of an alpha diversity metric or a single point on a beta diversity ordination plot for each sample. As recognized for count data generated using classical microbiological methods, read counts obtained from a sample are random data linked to source properties by a probabilistic process. Thus, diversity analysis has focused on diversity of (normalized) samples rather than probabilistic inference about source diversity. This study applies fundamentals of statistical analysis for quantitative microbiology (e.g., microscopy, plating, most probable number methods) to sample collection and processing procedures of amplicon sequencing methods to facilitate inference reflecting the probabilistic nature of such data and evaluation of uncertainty in diversity metrics. Types of random error are described and clustering of microorganisms in the source, differential analytical recovery during sample processing, and amplification are found to invalidate a multinomial relative abundance model. The zeros often abounding in amplicon sequencing data and their implications are addressed, and Bayesian analysis is applied to estimate the source Shannon index given unnormalized data (both simulated and real). Inference about source diversity is found to require knowledge of the exact number of unique variants in the source, which is practically unknowable due to library size limitations and the inability to differentiate zeros corresponding to variants that are actually absent in the source from zeros corresponding to variants that were merely not detected. Given these problems with estimation of diversity in the source even when the basic multinomial model is valid, sample-level diversity analysis approaches are discussed.


2020 ◽  
Author(s):  
Ellen S. Cameron ◽  
Philip J. Schmidt ◽  
Benjamin J.-M. Tremblay ◽  
Monica B. Emelko ◽  
Kirsten M. Müller

AbstractThe application of amplicon sequencing in water research provides a rapid and sensitive technique for microbial community analysis in a variety of environments ranging from freshwater lakes to water and wastewater treatment plants. It has revolutionized our ability to study DNA collected from environmental samples by eliminating the challenges associated with lab cultivation and taxonomic identification. DNA sequencing data consist of discrete counts of sequence reads, the total number of which is the library size. Samples may have different library sizes and thus, a normalization technique is required to meaningfully compare them. The process of randomly subsampling sequences to a selected normalized library size from the sample library—rarefying—is one such normalization technique. However, rarefying has been criticized as a normalization technique because data can be omitted through the exclusion of either excess sequences or entire samples, depending on the rarefied library size selected. Although it has been suggested that rarefying should be avoided altogether, we propose that repeatedly rarefying enables (i) characterization of the variation introduced to diversity analyses by this random subsampling and (ii) selection of smaller library sizes where necessary to incorporate all samples in the analysis. Rarefying may be a statistically valid normalization technique, but researchers should evaluate their data to make appropriate decisions regarding library size selection and subsampling type. The impact of normalized library size selection and rarefying with or without replacement in diversity analyses were evaluated herein.Highlights▪ Amplicon sequencing technology for environmental water samples is reviewed▪ Sequencing data must be normalized to allow comparison in diversity analyses▪ Rarefying normalizes library sizes by subsampling from observed sequences▪ Criticisms of data loss through rarefying can be resolved by rarefying repeatedly▪ Rarefying repeatedly characterizes errors introduced by subsampling sequences


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 7
Author(s):  
Sebastien Theil ◽  
Etienne Rifa

Bioinformatic tools for marker gene sequencing data analysis are continuously and rapidly evolving, thus integrating most recent techniques and tools is challenging. We present an R package for data analysis of 16S and ITS amplicons based sequencing. This workflow is based on several R functions and performs automatic treatments from fastq sequence files to diversity and differential analysis with statistical validation. The main purpose of this package is to automate bioinformatic analysis, ensure reproducibility between projects, and to be flexible enough to quickly integrate new bioinformatic tools or statistical methods. rANOMALY is an easy to install and customizable R package, that uses amplicon sequence variants (ASV) level for microbial community characterization. It integrates all assets of the latest bioinformatics methods, such as better sequence tracking, decontamination from control samples, use of multiple reference databases for taxonomic annotation, all main ecological analysis for which we propose advanced statistical tests, and a cross-validated differential analysis by four different methods. Our package produces ready to publish figures, and all of its outputs are made to be integrated in Rmarkdown code to produce automated reports.


2000 ◽  
Vol 46 (3) ◽  
pp. 173-185 ◽  
Author(s):  
Michael G. Kaufman ◽  
Edward D. Walker ◽  
David A. Odelson ◽  
Michael J. Klug

2021 ◽  
Author(s):  
Pengfan Zhang ◽  
Stjin Spaepen ◽  
Yang Bai ◽  
Stephane Hacquard ◽  
Ruben Garrido-Oter

AbstractMotivationSynthetic microbial communities (SynComs) constitute an emergent and powerful tool in biological, biomedical, and biotechnological research. Despite recent advances in algorithms for analysis of culture-independent amplicon sequencing data from microbial communities, there is a lack of tools specifically designed for analysing SynCom data, where reference sequences for each strain are available.ResultsHere we present Rbec, a tool designed for analysing SynCom data that outperforms current methods by accurately correcting errors in amplicon sequences and identifying intra-strain polymorphic variation. Extensive evaluation using mock bacterial and fungal communities show that our tool performs robustly for samples of varying complexity, diversity, and sequencing depth. Further, Rbec also allows accurate detection of contaminations in SynCom experiments.AvailabilityRbec is freely available as an open-source R package and can be downloaded at: https://github.com/PengfanZhang/Microbiome.


Sign in / Sign up

Export Citation Format

Share Document