‘SRS’ R Package and ‘q2-srs’ QIIME 2 Plugin: Normalization of Microbiome Data Using Scaling with Ranked Subsampling (SRS)

Several ecological data types, especially microbiome count data, are commonly sample-wise normalized before analysis to correct for sampling bias and other technical artifacts. Recently, we developed an algorithm for the normalization of ecological count data called ‘scaling with ranked subsampling (SRS)’, which surpasses the widely adopted ‘rarefying’ (random subsampling without replacement) in reproducibility and in safeguarding the original community structure. Here, we describe an implementation of the SRS algorithm in the ‘SRS’ R package and the ‘q2-srs’ QIIME 2 plugin. We also provide accessory functions for dataset exploration to guide the choice of parameters for SRS.

Download Full-text

Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models

BMC Bioinformatics ◽

10.1186/s12859-021-04221-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Arnaud Liehrmann ◽

Guillem Rigaill ◽

Toby Dylan Hocking

Keyword(s):

Histone Modifications ◽

Count Data ◽

High Throughput Sequencing ◽

Genetic Regulation ◽

Regulation Of Gene Expression ◽

Basic Mechanism ◽

R Package ◽

Detection Accuracy ◽

Full Potential ◽

Segmentation Models

Abstract Background Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. Results Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS (https://github.com/aLiehrmann/CROCS), detect the peaks more accurately than algorithms which rely on natural assumptions. Conclusion The segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications.

Download Full-text

tidyMicro: a pipeline for microbiome data analysis and visualization using the tidyverse in R

BMC Bioinformatics ◽

10.1186/s12859-021-03967-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Charlie M. Carpenter ◽

Daniel N. Frank ◽

Kayla Williamson ◽

Jaron Arbet ◽

Brandie D. Wagner ◽

...

Keyword(s):

Microbial Communities ◽

Open Source ◽

Data Structures ◽

Negative Binomial ◽

Rocky Mountain ◽

R Package ◽

Microbiome Analysis ◽

External Data ◽

Data Tables ◽

Microbiome Data

Abstract Background The drive to understand how microbial communities interact with their environments has inspired innovations across many fields. The data generated from sequence-based analyses of microbial communities typically are of high dimensionality and can involve multiple data tables consisting of taxonomic or functional gene/pathway counts. Merging multiple high dimensional tables with study-related metadata can be challenging. Existing microbiome pipelines available in R have created their own data structures to manage this problem. However, these data structures may be unfamiliar to analysts new to microbiome data or R and do not allow for deviations from internal workflows. Existing analysis tools also focus primarily on community-level analyses and exploratory visualizations, as opposed to analyses of individual taxa. Results We developed the R package “tidyMicro” to serve as a more complete microbiome analysis pipeline. This open source software provides all of the essential tools available in other popular packages (e.g., management of sequence count tables, standard exploratory visualizations, and diversity inference tools) supplemented with multiple options for regression modelling (e.g., negative binomial, beta binomial, and/or rank based testing) and novel visualizations to improve interpretability (e.g., Rocky Mountain plots, longitudinal ordination plots). This comprehensive pipeline for microbiome analysis also maintains data structures familiar to R users to improve analysts’ control over workflow. A complete vignette is provided to aid new users in analysis workflow. Conclusions tidyMicro provides a reliable alternative to popular microbiome analysis packages in R. We provide standard tools as well as novel extensions on standard analyses to improve interpretability results while maintaining object malleability to encourage open source collaboration. The simple examples and full workflow from the package are reproducible and applicable to external data sets.

Download Full-text

mangal - making ecological network analysis simple

10.1101/002634 ◽

2014 ◽

Cited By ~ 1

Author(s):

Timothée E Poisot ◽

Benjamin Baiser ◽

Jennifer A Dunne ◽

Sonia Kéfi ◽

Francois Massol ◽

...

Keyword(s):

Network Analysis ◽

R Package ◽

Ecological Networks ◽

Ecological Interactions ◽

Ecological Network ◽

Ecological Data ◽

Meta Data ◽

Ecological Network Analysis ◽

Analysis Process ◽

Access Data

The study of ecological networks is severely limited by (i) the difficulty to access data, (ii) the lack of a standardized way to link meta-data with interactions, and (iii) the disparity of formats in which ecological networks themselves are represented. To overcome these limitations, we conceived a data specification for ecological networks. We implemented a database respecting this standard, and released a R package ( `rmangal`) allowing users to programmatically access, curate, and deposit data on ecological interactions. In this article, we show how these tools, in conjunctions with other frameworks for the programmatic manipulation of open ecological data, streamlines the analysis process, and improves eplicability and reproducibility of ecological networks studies.

Download Full-text

Missing value imputation for physical activity data measured by accelerometer

Statistical Methods in Medical Research ◽

10.1177/0962280216633248 ◽

2016 ◽

Vol 27 (2) ◽

pp. 490-506 ◽

Cited By ~ 11

Author(s):

Jung Ae Lee ◽

Jeff Gill

Keyword(s):

Physical Activity ◽

Count Data ◽

R Package ◽

Predictive Distribution ◽

Epidemiological Studies ◽

Accelerometer Data ◽

Activity Data ◽

Health And Nutrition ◽

Log Normal ◽

Over Dispersion

An accelerometer, a wearable motion sensor on the hip or wrist, is becoming a popular tool in clinical and epidemiological studies for measuring the physical activity. Such data provide a series of activity counts at every minute or even more often and displays a person’s activity pattern throughout a day. Unfortunately, the collected data can include irregular missing intervals because of noncompliance of participants and therefore make the statistical analysis more challenging. The purpose of this study is to develop a novel imputation method to handle the multivariate count data, motivated by the accelerometer data structure. We specify the predictive distribution of the missing data with a mixture of zero-inflated Poisson and Log-normal distribution, which is shown to be effective to deal with the minute-by-minute autocorrelation as well as under- and over-dispersion of count data. The imputation is performed at the minute level and follows the principles of multiple imputation using a fully conditional specification with the chained algorithm. To facilitate the practical use of this method, we provide an R package accelmissing. Our method is demonstrated using 2003−2004 National Health and Nutrition Examination Survey data.

Download Full-text

A brief introduction to mixed effects modelling and multi-model inference in ecology

PeerJ ◽

10.7717/peerj.4794 ◽

2018 ◽

Vol 6 ◽

pp. e4794 ◽

Cited By ~ 343

Author(s):

Xavier A. Harrison ◽

Lynda Donaldson ◽

Maria Eugenia Correa-Cano ◽

Julian Evans ◽

David N. Fisher ◽

...

Keyword(s):

Best Practice ◽

Statistical Modelling ◽

Mixed Effects ◽

Complex Model ◽

Biological Data ◽

Data Types ◽

Ecological Data ◽

Model Inference ◽

Model Structures ◽

Modelling Process

The use of linear mixed effects models (LMMs) is increasingly common in the analysis of biological data. Whilst LMMs offer a flexible approach to modelling a broad range of data types, ecological data are often complex and require complex model structures, and the fitting and interpretation of such models is not always straightforward. The ability to achieve robust biological inference requires that practitioners know how and when to apply these tools. Here, we provide a general overview of current methods for the application of LMMs to biological data, and highlight the typical pitfalls that can be encountered in the statistical modelling process. We tackle several issues regarding methods of model selection, with particular reference to the use of information theory and multi-model inference in ecology. We offer practical solutions and direct the reader to key references that provide further technical detail for those seeking a deeper understanding. This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.

Download Full-text

A brief introduction to mixed effects modelling and multi-model inference in ecology

10.7287/peerj.preprints.3113v2 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xavier A Harrison ◽

Lynda Donaldson ◽

Maria Eugenia Correa-Cano ◽

Julian Evans ◽

David N Fisher ◽

...

Keyword(s):

Best Practice ◽

Statistical Modelling ◽

Mixed Effects ◽

Complex Model ◽

Biological Data ◽

Type I ◽

Data Types ◽

Ecological Data ◽

Model Inference ◽

Model Structures

The use of linear mixed effects models (LMMs) is increasingly common in the analysis of biological data. Whilst LMMs offer a flexible approach to modelling a broad range of data types, ecological data are often complex and require complex model structures, and the fitting and interpretation of such models is not always straightforward. The ability to achieve robust biological inference requires that practitioners know how and when to apply these tools. Here, we provide a general overview of current methods for the application of LMMs to biological data, and highlight the typical pitfalls that can be encountered in the statistical modelling process. We tackle several issues relating to the use of information theory and multi-model inference in ecology, and demonstrate the tendency for data dredging to lead to greatly inflated Type I error rate (false positives) and impaired inference. We offer practical solutions and direct the reader to key references that provide further technical detail for those seeking a deeper understanding. This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.

Download Full-text

Biclique: An R package for Maximal Biclique Enumeration in Bipartite Graphs

10.21203/rs.2.16755/v2 ◽

2020 ◽

Author(s):

Yuping Lu ◽

Charles A. Phillips ◽

Michael A. Langston

Keyword(s):

State Of The Art ◽

Basic Research ◽

R Package ◽

Bipartite Graphs ◽

Heterogeneous Data ◽

General Purpose ◽

Public Repository ◽

Data Types ◽

Statistical Programming ◽

Reference Manual

Abstract Objective Bipartite graphs are widely used to model relationships between pairs of heterogeneous data types. Maximal bicliques are foundational structures in such graphs, and their enumeration is an important task in systems biology, epidemiology and many other problem domains. Thus, there is a need for an efficient, general purpose, publicly available tool to enumerate maximal bicliques in bipartite graphs. The statistical programming language R is a logical choice for such a tool, but until now no R package has existed for this purpose. Our objective is to provide such a package, so that the research community can more easily perform this computationally demanding task. Results Biclique is an R package that takes as input a bipartite graph and produces a listing of all maximal bicliques in this graph. Input and output formats are straightforward, with examples provided both in this paper and in the package documentation. Biclique employs a state-of-the-art algorithm previously developed for basic research in functional genomics. This package, along with its source code and reference manual, are freely available from the CRAN public repository at https://cran.r-project.org/web/packages/biclique/index.html .

Download Full-text

Phylofactorization: a graph-partitioning algorithm to identify phylogenetic scales of ecological data

10.1101/235341 ◽

2017 ◽

Cited By ~ 4

Author(s):

Alex D. Washburne ◽

Justin D. Silverman ◽

James T. Morton ◽

Daniel J. Becker ◽

Daniel Crowley ◽

...

Keyword(s):

Body Mass ◽

Graph Partitioning ◽

R Package ◽

Ecological Groups ◽

Reduced Rank Regression ◽

Ecological Data ◽

Phylogenetic Constraint ◽

Partitioning Algorithm ◽

Relative Abundances ◽

Additive Modeling

AbstractThe problem of pattern and scale is a central challenge in ecology. The problem of scale is central to community ecology, where functional ecological groups are aggregated and treated as a unit underlying an ecological pattern, such as aggregation of “nitrogen fixing trees” into a total abundance of a trait underlying ecosystem physiology. With the emergence of massive community ecological datasets, from microbiomes to breeding bird surveys, there is a need to objectively identify the scales of organization pertaining to well-defined patterns in community ecological data.The phylogeny is a scaffold for identifying key phylogenetic scales associated with macroscopic patterns. Phylofactorization was developed to objectively identify phylogenetic scales underlying patterns in relative abundance data. However, many ecological data, such as presence-absences and counts, are not relative abundances, yet it is still desireable and informative to identify phylogenetic scales underlying a pattern of interest. Here, we generalize phylofactorization beyond relative abundances to a graph-partitioning algorithm for any community ecological data.Generalizing phylofactorization connects many tools from data analysis to phylogenetically-informe analysis of community ecological data. Two-sample tests identify three phylogenetic factors of mammalian body mass which arose during the K-Pg extinction event, consistent with other analyses of mammalian body mass evolution. Projection of data onto coordinates defined by the phylogeny yield a phylogenetic principal components analysis which refines our understanding of the major sources of variation in the human gut microbiome. These same coordinates allow generalized additive modeling of microbes in Central Park soils and confirm that a large clade of Acidobacteria thrive in neutral soils. Generalized linear and additive modeling of exponential family random variables can be performed by phylogenetically-constrained reduced-rank regression or stepwise factor contrasts. We finish with a discussion of how phylofac-torization produces an ecological species concept with a phylogenetic constraint. All of these tools can be implemented with a new R package available online.

Download Full-text

Uncovering thematic structure to link co-occurring taxa and predicted functional content in 16S rRNA marker gene surveys

10.1101/146126 ◽

2017 ◽

Cited By ~ 3

Author(s):

Stephen Woloszynek ◽

Joshua Chang Mell ◽

Gideon Simpson ◽

Michael P. O’Connor ◽

Gail L. Rosen

Keyword(s):

16S Rrna ◽

Topic Model ◽

Marker Gene ◽

Amplicon Sequencing ◽

R Package ◽

Bayesian Regression ◽

Thematic Structure ◽

Microbiome Data ◽

Functional Components ◽

Functional Content

ABSTRACTBackgroundAnalysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). But elucidating key associations is often difficult since microbiome data are compositional, high dimensional, and sparse. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to, for example, host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. In addition, analyses that utilize both taxonomic and predicted functional abundances typically independently characterize the taxonomic and functional profiles before linking them to sample information. This prevents investigators from identifying the specific functional components associate with which subsets of co-occurring taxa.ResultsWe provide an approach to explore co-occurring taxa using “topics” generated via a topic model and then link these topics to specific sample classes (e.g., diseased versus healthy). Rather than inferring predicted functional content independently from taxonomic abundances, we instead focus on inference of functional content within topics, which we parse by estimating pathway-topic interactions through a multilevel, fully Bayesian regression model. We apply our methods to two large publically available 16S amplicon sequencing datasets: an inflammatory bowel disease (IBD) dataset from Gevers et al. and data from the American Gut (AG) project. When applied to the Gevers et al. IBD study, we demonstrate that a topic highly associated with Crohn’s disease (CD) diagnosis is (1) dominated by a cluster of bacteria known to be linked with CD and (2) uniquely enriched for a subset of lipopolysaccharide (LPS) synthesis genes. In the AG data, our approach found that individuals with plant-based diets were enriched with Lachnospiraceae, Roseburia, Blautia, and Ruminococcaceae, as well as fluorobenzoate degradation pathways, whereas pathways involved in LPS biosynthesis were depleted.ConclusionsWe introduce an approach for uncovering latent thematic structure in the context of sample features for 16S rRNA surveys. Using our topic-model approach, investigators can (1) capture groups of co-occurring taxa termed topics, (2) uncover within-topic functional potential, and (3) identify gene sets that may guide future inquiry. These methods have been implemented in a freely available R package https://github.com/EESI/themetagenomics.

Download Full-text

Correcting for Background Noise Improves Phenotype Prediction from Human Gut Microbiome Data

10.1101/2021.03.19.436199 ◽

2021 ◽

Author(s):

Leah Briscoe ◽

Brunilda Balliu ◽

Sriram Sankararaman ◽

Eran Halperin ◽

Nandita R. Garud

Keyword(s):

Background Noise ◽

Prediction Accuracy ◽

Metagenomic Data ◽

Data Types ◽

Biological Variables ◽

Host Sex ◽

Sources Of Variation ◽

Microbiome Data ◽

Log Ratio ◽

Unsupervised Approaches

The ability to predict human phenotypes accurately from metagenomic data is crucial for developing biomarkers and therapeutics for diseases. However, metagenomic data is commonly affected by technical or biological variables, unrelated to the phenotype of interest, such as sequencing protocol or host sex, which can greatly reduce or, when correlated to the phenotype of interest, inflate prediction accuracy. We perform a comparative analysis of the ability of different data transformations and existing supervised and unsupervised methods to correct microbiome data for background noise. We find that supervised methods are limited because they cannot account for unmeasured sources of variation. In addition, we observe that unsupervised approaches are often superior in addressing these issues, but existing methods developed for other 'omic data types, e.g., gene expression and methylation, are restricted by parametric assumptions unsuitable for microbiome data, which is typically compositional, highly skewed, and sparse. We show that application of the centered log-ratio transformation prior to correction with unsupervised approaches improves prediction accuracy for many phenotypes while simultaneously reducing variance due to unwanted sources of variation. As new and larger metagenomic datasets become increasingly available, background noise correction will become essential for generating reproducible microbiome analyses.

Download Full-text