MAGMA: inference of sparse microbial association networks

AbstractMicroorganisms often live in symbiotic relationship with their environment and they play a central role in many biological processes. They form a complex system of interacting species. Within the gut micro-biota these interaction patterns have been shown to be involved in obesity, diabetes and mental disease. Understanding the mechanisms that govern this ecosystem is therefore an important scientific challenge. Recently, the acquisition of large samples of microbiota data through metabarcoding or metagenomics has become easier.Until now correlation-based network analysis and graphical modelling have been used to identify the putative interaction networks formed by the species of microorganisms, but these methods do not take into account all features of microbiota data. Indeed, correlation-based network cannot distinguish between direct and indirect correlations and simple graphical models cannot include covariates as environmental factors that shape the microbiota abundance. Furthermore, the compositional nature of the microbiota data is often ignored or existing normalizations are often based on log-transformations, which is somewhat arbitrary and therefore affects the results in unknown ways.We have developed a novel method, called MAGMA, for detecting interactions between microbiota that takes into account the noisy structure of the microbiota data, involving an excess of zero counts, overdispersion, compositionality and possible covariate inclusion. The method is based on Copula Gaus-sian graphical models whereby we model the marginals with zero-inflated negative binomial generalized linear models. The inference is based on an efficient median imputation procedure combined with the graphical lasso.We show that our method beats all existing methods in recovering microbial association networks in an extensive simulation study. Moreover, the analysis of two 16S microbial data studies with our method reveals interesting new biology.MAGMA is implemented as an R-package and is freely available at https://gitlab.com/arcgl/rmagma, which also includes the scripts used to prepare the material in this paper.

Download Full-text

Tailored Graphical Lasso for Data Integration in Gene Network Reconstruction

10.1101/2020.12.29.424744 ◽

2020 ◽

Author(s):

Camilla Lingjærde ◽

Tonje G Lien ◽

Ørnulf Borgan ◽

Ingrid K Glad

Keyword(s):

Graphical Models ◽

Biological Networks ◽

Prior Information ◽

Graphical Model ◽

Network Models ◽

R Package ◽

Biological Information ◽

Graphical Lasso ◽

Special Cases ◽

The Matrix

AbstractBackgroundIdentifying gene interactions is a topic of great importance in genomics, and approaches based on network models provide a powerful tool for studying these. Assuming a Gaussian graphical model, a gene association network may be estimated from multiomic data based on the non-zero entries of the inverse covariance matrix. Inferring such biological networks is challenging because of the high dimensionality of the problem, making traditional estimators unsuitable. The graphical lasso is constructed for the estimation of sparse inverse covariance matrices in Gaussian graphical models in such situations, using L1-penalization on the matrix entries. An extension of the graphical lasso is the weighted graphical lasso, in which prior biological information from other (data) sources is integrated into the model through the weights. There are however issues with this approach, as it naïvely forces the prior information into the network estimation, even if it is misleading or does not agree with the data at hand. Further, if an associated network based on other data is used as the prior, weighted graphical lasso often fails to utilize the information effectively.ResultsWe propose a novel graphical lasso approach, the tailored graphical lasso, that aims to handle prior information of unknown accuracy more effectively. We provide an R package implementing the method, tailoredGlasso. Applying the method to both simulated and real multiomic data sets, we find that it outperforms the unweighted and weighted graphical lasso in terms of all performance measures we consider. In fact, the graphical lasso and weighted graphical lasso can be considered special cases of the tailored graphical lasso, and a parameter determined by the data measures the usefulness of the prior information. With our method, mRNA data are demonstrated to provide highly useful prior information for protein-protein interaction networks.ConclusionsThe method we introduce utilizes useful prior information more effectively without involving any risk of loss of accuracy should the prior information be misleading.

Download Full-text

tidyMicro: a pipeline for microbiome data analysis and visualization using the tidyverse in R

BMC Bioinformatics ◽

10.1186/s12859-021-03967-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Charlie M. Carpenter ◽

Daniel N. Frank ◽

Kayla Williamson ◽

Jaron Arbet ◽

Brandie D. Wagner ◽

...

Keyword(s):

Microbial Communities ◽

Open Source ◽

Data Structures ◽

Negative Binomial ◽

Rocky Mountain ◽

R Package ◽

Microbiome Analysis ◽

External Data ◽

Data Tables ◽

Microbiome Data

Abstract Background The drive to understand how microbial communities interact with their environments has inspired innovations across many fields. The data generated from sequence-based analyses of microbial communities typically are of high dimensionality and can involve multiple data tables consisting of taxonomic or functional gene/pathway counts. Merging multiple high dimensional tables with study-related metadata can be challenging. Existing microbiome pipelines available in R have created their own data structures to manage this problem. However, these data structures may be unfamiliar to analysts new to microbiome data or R and do not allow for deviations from internal workflows. Existing analysis tools also focus primarily on community-level analyses and exploratory visualizations, as opposed to analyses of individual taxa. Results We developed the R package “tidyMicro” to serve as a more complete microbiome analysis pipeline. This open source software provides all of the essential tools available in other popular packages (e.g., management of sequence count tables, standard exploratory visualizations, and diversity inference tools) supplemented with multiple options for regression modelling (e.g., negative binomial, beta binomial, and/or rank based testing) and novel visualizations to improve interpretability (e.g., Rocky Mountain plots, longitudinal ordination plots). This comprehensive pipeline for microbiome analysis also maintains data structures familiar to R users to improve analysts’ control over workflow. A complete vignette is provided to aid new users in analysis workflow. Conclusions tidyMicro provides a reliable alternative to popular microbiome analysis packages in R. We provide standard tools as well as novel extensions on standard analyses to improve interpretability results while maintaining object malleability to encourage open source collaboration. The simple examples and full workflow from the package are reproducible and applicable to external data sets.

Download Full-text

Temporal analysis of the relationship between dengue and meteorological variables in the city of Rio de Janeiro, Brazil, 2001-2009

Cadernos de Saúde Pública ◽

10.1590/s0102-311x2012001100018 ◽

2012 ◽

Vol 28 (11) ◽

pp. 2189-2197 ◽

Cited By ~ 38

Author(s):

Adriana Fagundes Gomes ◽

Aline Araújo Nobre ◽

Oswaldo Gonçalves Cruz

Keyword(s):

Minimum Temperature ◽

Rio De Janeiro ◽

Linear Models ◽

Negative Binomial ◽

Critical Factor ◽

Temporal Analysis ◽

Temporal And Spatial Distribution ◽

Temperature And Precipitation ◽

The Relationship ◽

The City

Dengue, a reemerging disease, is one of the most important viral diseases transmitted by mosquitoes. Climate is considered an important factor in the temporal and spatial distribution of vector-transmitted diseases. This study examined the effect of seasonal factors and the relationship between climatic variables and dengue risk in the city of Rio de Janeiro, Brazil, from 2001 to 2009. Generalized linear models were used, with Poisson and negative binomial distributions. The best fitted model was the one with "minimum temperature" and "precipitation", both lagged by one month, controlled for "year". In that model, a 1°C increase in a month's minimum temperature led to a 45% increase in dengue cases in the following month, while a 10-millimeter rise in precipitation led to a 6% increase in dengue cases in the following month. Dengue transmission involves many factors: although still not fully understood, climate is a critical factor, since it facilitates analysis of the risk of epidemics.

Download Full-text

tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models

Journal of Statistical Software ◽

10.18637/jss.v082.i05 ◽

2017 ◽

Vol 82 (5) ◽

Cited By ~ 28

Author(s):

Tobias Liboschik ◽

Konstantinos Fokianos ◽

Roland Fried

Keyword(s):

Time Series ◽

Generalized Linear Models ◽

Linear Models ◽

R Package ◽

Count Time Series

Download Full-text

An Ethology of Adaptation: Dolphins Stop Feeding but Continue Socializing in Construction-Degraded Habitat

Frontiers in Marine Science ◽

10.3389/fmars.2021.603229 ◽

2021 ◽

Vol 8 ◽

Author(s):

Ann Weaver

Keyword(s):

Linear Models ◽

Negative Binomial ◽

Environmental Changes ◽

Bottlenose Dolphins ◽

Observation Time ◽

Mixed States ◽

The West ◽

Behavioral Adaptations ◽

Construction Zone ◽

Free Ranging

Adaptation is a biological mechanism by which organisms adjust physically or behaviorally to changes in their environment to become more suited to it. This is a report of free-ranging bottlenose dolphins’ behavioral adaptations to environmental changes from coastal construction in prime habitat. Construction was a 5-year bridge removal and replacement project in a tidal inlet along west central Florida’s Gulf of Mexico coastline. It occurred in two consecutive 2.5-year phases to replace the west and east lanes, respectively. Lane phases involved demolition/removal of above-water cement structures, below-water cement structures, and reinstallation of below + above water cement structures (N = 2,098 photos). Data were longitudinal (11 years: 2005–2016, N = 1,219 surveys 2–4 times/week/11 years, N = 4,753 dolphins, 591.95 h of observation in the construction zone, 126 before-construction surveys, 568 during-construction surveys, 525 after-construction surveys). The dependent variable was numbers of dolphins (count) in the immediate construction zone. Three analyses examined presence/absence, total numbers of dolphins, and numbers of dolphins engaged in five behavior states (forage-feeding, socializing, direct travel, meandering travel, and mixed states) across construction. Analyses were GLIMMIX generalized linear models for logistic and negative binomial regressions to account for observation time differences as an exposure (offset) variable. Results showed a higher probability of dolphin presence than absence before construction began, more total dolphins before construction, and significant decreases in the numbers of feeding but not socializing dolphins. Significant changes in temporal rhythms also revealed finer-grained adaptations. Conclusions were that the dolphins adapted to construction in two ways, by establishing feeding locations beyond the disturbed construction zone and shifting temporal rhythms of behaviors that they continued to exhibit in the construction zone to later in the day when construction activities were minimized. This is the first study to suggest that the dolphins learned to cope with coastal construction with variable adjustments.

Download Full-text

Generalized linear models with random effects for the description of data with excess zeros

Archives Animal Breeding ◽

10.5194/aab-54-661-2011 ◽

2011 ◽

Vol 54 (6) ◽

pp. 661-675

Author(s):

N. Mielenz ◽

K. Thamm ◽

M. Bulang ◽

J. Spilke

Keyword(s):

Random Effects ◽

Negative Binomial Distribution ◽

Binomial Distribution ◽

Linear Models ◽

Negative Binomial ◽

Detailed Comparison ◽

Response Scale ◽

Explanatory Variables ◽

Excess Zeros ◽

Explanatory Factors

Abstract. In this paper count data with excess zeros and repeated observations per subject are evaluated. If the number of values observed for the zero event in the trial substantially exceeds the expected number (derived from the Poisson or from the negative binomial distribution), then there is an excess of zeros. Hurdle and zero-inflated models with random effects are available in order to evaluate this type of data. In this paper both model approaches are presented and are used for the evaluation of the number of visits to the feeder per cow per hour. Finally, for the analysis of the target trait a hurdle model with random effects based on a negative binomial distribution was used. This analysis was derived from a detailed comparison of models and was needed because of a simpler computer implementation. For improved interpretation of the results, the levels of the explanatory factors (for example, the classes of lactation) were not averaged in the link scale, but rather in the response scale. The deciding explanatory variables for the pattern of visiting activities in the 24-hour cycle are the milking and cleaning times at hours 4, 7, 12 and 20. The highly significant differences in the visiting frequencies of cows of the first lactation and those of higher lactations were explained by competition for access to the feeder and thus to the feed.

Download Full-text

Inferring viral occurrence patterns through a synthetic data simulation

10.1101/2021.07.13.452220 ◽

2021 ◽

Author(s):

Ville N Pimenoff ◽

Ramon Cleries

Keyword(s):

Linear Models ◽

Population Sample ◽

Synthetic Data ◽

Interaction Patterns ◽

Viral Strain ◽

Data Simulation ◽

Synthetic Datasets ◽

Pathogen Occurrence ◽

Log Linear ◽

Occurrence Patterns

Viruses infecting humans are manifold and several of them provoke significant morbidity and mortality. Simulations creating large synthetic datasets from observed multiple viral strain infections in a limited population sample can be a powerful tool to infer significant pathogen occurrence and interaction patterns, particularly if limited number of observed data units is available. Here, to demonstrate diverse human papillomavirus (HPV) strain occurrence patterns, we used log-linear models combined with Bayesian framework for graphical independence network (GIN) analysis. That is, to simulate datasets based on modeling the probabilistic associations between observed viral data points, i.e different viral strain infections in a set of population samples. Our GIN analysis outperformed in precision all oversampling methods tested for simulating large synthetic viral strain-level prevalence dataset from observed set of HPVs data. Altogether, we demonstrate that network modeling is a potent tool for creating synthetic viral datasets for comprehensive pathogen occurrence and interaction pattern estimations.

Download Full-text

Comparison and evaluation of statistical error models for scRNA-seq

10.1101/2021.07.07.451498 ◽

2021 ◽

Author(s):

Saket Choudhary ◽

Rahul Satija

Keyword(s):

Linear Models ◽

Negative Binomial ◽

Statistical Error ◽

Rna Seq ◽

Multiple Sources ◽

Error Models ◽

Wide Range ◽

Data Driven Approach ◽

Downstream Analysis ◽

Experimental Processing

Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate. Here, we analyze 58 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation. Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.

Download Full-text

modelBuildR: an R package for model building and feature selection with erroneous classifications

PeerJ ◽

10.7717/peerj.10849 ◽

2021 ◽

Vol 9 ◽

pp. e10849

Author(s):

Maximilian Knoll ◽

Jennifer Furkel ◽

Juergen Debus ◽

Amir Abdollahi

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Model Building ◽

Linear Models ◽

Binary Classification ◽

Ground Truth ◽

R Package ◽

Methylation Array ◽

Survival Difference ◽

Error Probabilities

Background Model building is a crucial part of omics based biomedical research to transfer classifications and obtain insights into underlying mechanisms. Feature selection is often based on minimizing error between model predictions and given classification (maximizing accuracy). Human ratings/classifications, however, might be error prone, with discordance rates between experts of 5–15%. We therefore evaluate if a feature pre-filtering step might improve identification of features associated with true underlying groups. Methods Data was simulated for up to 100 samples and up to 10,000 features, 10% of which were associated with the ground truth comprising 2–10 normally distributed populations. Binary and semi-quantitative ratings with varying error probabilities were used as classification. For feature preselection standard cross-validation (V2) was compared to a novel heuristic (V1) applying univariate testing, multiplicity adjustment and cross-validation on switched dependent (classification) and independent (features) variables. Preselected features were used to train logistic regression/linear models (backward selection, AIC). Predictions were compared against the ground truth (ROC, multiclass-ROC). As use case, multiple feature selection/classification methods were benchmarked against the novel heuristic to identify prognostically different G-CIMP negative glioblastoma tumors from the TCGA-GBM 450 k methylation array data cohort, starting from a fuzzy umap based rough and erroneous separation. Results V1 yielded higher median AUC ranks for two true groups (ground truth), with smaller differences for true graduated differences (3–10 groups). Lower fractions of models were successfully fit with V1. Median AUCs for binary classification and two true groups were 0.91 (range: 0.54–1.00) for V1 (Benjamini-Hochberg) and 0.70 (0.28–1.00) for V2, 13% (n = 616) of V2 models showed AUCs < = 50% for 25 samples and 100 features. For larger numbers of features and samples, median AUCs were 0.75 (range 0.59–1.00) for V1 and 0.54 (range 0.32–0.75) for V2. In the TCGA-GBM data, modelBuildR allowed best prognostic separation of patients with highest median overall survival difference (7.51 months) followed a difference of 6.04 months for a random forest based method. Conclusions The proposed heuristic is beneficial for the retrieval of features associated with two true groups classified with errors. We provide the R package modelBuildR to simplify (comparative) evaluation/application of the proposed heuristic (http://github.com/mknoll/modelBuildR).

Download Full-text

DEsingle for detecting three types of differential expression in single-cell RNA-seq data

10.1101/173997 ◽

2017 ◽

Cited By ~ 1

Author(s):

Zhun Miao ◽

Ke Deng ◽

Xiaowo Wang ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Differential Expression ◽

Negative Binomial ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Binomial Model ◽

Supplementary Data ◽

Rna Seq ◽

Real Zeros

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.

Download Full-text