A zero inflated log-normal model for inference of sparse microbial association networks

The advent of high-throughput metagenomic sequencing has prompted the development of efficient taxonomic profiling methods allowing to measure the presence, abundance and phylogeny of organisms in a wide range of environmental samples. Multivariate sequence-derived abundance data further has the potential to enable inference of ecological associations between microbial populations, but several technical issues need to be accounted for, like the compositional nature of the data, its extreme sparsity and overdispersion, as well as the frequent need to operate in under-determined regimes. The ecological network reconstruction problem is frequently cast into the paradigm of Gaussian Graphical Models (GGMs) for which efficient structure inference algorithms are available, like the graphical lasso and neighborhood selection. Unfortunately, GGMs or variants thereof can not properly account for the extremely sparse patterns occurring in real-world metagenomic taxonomic profiles. In particular, structural zeros (as opposed to sampling zeros) corresponding to true absences of biological signals fail to be properly handled by most statistical methods. We present here a zero-inflated log-normal graphical model (available at https://github.com/vincentprost/Zi-LN) specifically aimed at handling such “biological” zeros, and demonstrate significant performance gains over state-of-the-art statistical methods for the inference of microbial association networks, with most notable gains obtained when analyzing taxonomic profiles displaying sparsity levels on par with real-world metagenomic datasets.

Download Full-text

A zero inflated log-normal model for inference of sparse microbial association networks

10.1101/2020.11.13.381384 ◽

2020 ◽

Author(s):

Vincent Prost ◽

Stéphane Gazut ◽

Thomas Brüls

Keyword(s):

Statistical Methods ◽

Graphical Models ◽

Real World ◽

Graphical Model ◽

Metagenomic Sequencing ◽

Microbial Association ◽

Taxonomic Profiling ◽

Inference Algorithms ◽

Wide Range ◽

Log Normal

AbstractThe advent of high-throughput metagenomic sequencing has prompted the development of efficient taxonomic profiling methods allowing to measure the presence, abundance and phylogeny of organisms in a wide range of environmental samples. Multivariate sequence-derived abundance data further has the potential to enable inference of ecological associations between microbial populations, but several technical issues need to be accounted for, like the compositional nature of the data, its extreme sparsity and overdispersion, as well as the frequent need to operate in under-determined regimes.The ecological network reconstruction problem is frequently cast into the paradigm of Gaussian Graphical Models (GGMs) for which efficient structure inference algorithms are available, like the graphical lasso and neighborhood selection. Unfortunately, GGMs or variants thereof can not properly account for the extremely sparse patterns occurring in real-world metagenomic taxonomic profiles. In particular, structural zeros (as opposed to sampling zeros) corresponding to true absences of biological signals fail to be properly handled by most statistical methods.We present here a zero-inflated log-normal graphical model (available at https://github.com/vincentprost/Zi-LN) specifically aimed at handling such “biological” zeros, and demonstrate significant performance gains over state-of-the-art statistical methods for the inference of microbial association networks, with most notable gains obtained when analyzing taxonomic profiles displaying sparsity levels on par with real-world metagenomic datasets.

Download Full-text

Classification of unlabeled online media

Scientific Reports ◽

10.1038/s41598-021-85608-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sakthi Kumar Arul Prakash ◽

Conrad Tucker

Keyword(s):

Social Media ◽

Real World ◽

Graphical Model ◽

Ground Truth ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Social Media Networks ◽

Online Social Media ◽

Wide Range

AbstractThis work investigates the ability to classify misinformation in online social media networks in a manner that avoids the need for ground truth labels. Rather than approach the classification problem as a task for humans or machine learning algorithms, this work leverages user–user and user–media (i.e.,media likes) interactions to infer the type of information (fake vs. authentic) being spread, without needing to know the actual details of the information itself. To study the inception and evolution of user–user and user–media interactions over time, we create an experimental platform that mimics the functionality of real-world social media networks. We develop a graphical model that considers the evolution of this network topology to model the uncertainty (entropy) propagation when fake and authentic media disseminates across the network. The creation of a real-world social media network enables a wide range of hypotheses to be tested pertaining to users, their interactions with other users, and with media content. The discovery that the entropy of user–user and user–media interactions approximate fake and authentic media likes, enables us to classify fake media in an unsupervised learning manner.

Download Full-text

Classification of unlabeled online media

10.21203/rs.3.rs-107002/v1 ◽

2020 ◽

Author(s):

Sakthi Kumar Arul Prakash ◽

Conrad Tucker

Keyword(s):

Real World ◽

Graphical Model ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Online Media ◽

Media Content ◽

Social Media Networks ◽

Online Social Media ◽

Wide Range

Abstract This work investigates the ability to classify misinformation in online social media networks in a manner that avoids the need forground truth labels. Rather than approach the classification problem as a task for humans or machine learning algorithms, thiswork leverages user-user and user-media (i.e.,media likes) interactions to infer the type of information (fake vs. authentic) beingspread, without needing to know the actual details of the information itself. To study the inception and evolution of user-userand user-media interactions over time, we create an experimental platform that mimics the functionality of real world socialmedia networks. We develop a graphical model that considers the evolution of this network topology to model the uncertainty(entropy) propagation when fake and authentic media disseminates across the network. The creation of a real-world socialmedia network enables a wide range of hypotheses to be tested pertaining to users, their interactions with other users, andwith media content. The discovery that the entropy of user-user, and user-media interactions approximates fake and authenticmedia likes, enables us to classify fake media in an unsupervised learning manner.

Download Full-text

Efficient Regularization Parameter Selection for Latent Variable Graphical Models via Bi-Level Optimization

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/330 ◽

2019 ◽

Author(s):

Joachim Giesen ◽

Frank Nussbaum ◽

Christopher Schneider

Keyword(s):

Graphical Models ◽

Latent Variable ◽

Graphical Model ◽

Regularization Parameter ◽

Relative Weight ◽

Low Rank ◽

Semidefinite Program ◽

Wide Range ◽

Regularization Parameters ◽

Feasible Solutions

Latent variable graphical models are an extension of Gaussian graphical models that decompose the precision matrix into a sparse and a low-rank component. These models can be learned with theoretical guarantees from data via a semidefinite program. This program features two regularization terms, one for promoting sparsity and one for promoting a low rank. In practice, however, it is not straightforward to learn a good model since the model highly depends on the regularization parameters that control the relative weight of the loss function and the two regularization terms. Selecting good regularization parameters can be modeled as a bi-level optimization problem, where the upper level optimizes some form of generalization error and the lower level provides a description of the solution gamut. The solution gamut is the set of feasible solutions for all possible values of the regularization parameters. In practice, it is often not feasible to describe the solution gamut efficiently. Hence, algorithmic schemes for approximating solution gamuts have been devised. One such scheme is Benson's generic vector optimization algorithm that comes with approximation guarantees. So far Benson's algorithm has not been used in conjunction with semidefinite programs like the latent variable graphical Lasso. Here, we develop an adaptive variant of Benson's algorithm for the semidefinite case and show that it keeps the known approximation and run time guarantees. Furthermore, Benson's algorithm turns out to be practically more efficient for the latent variable graphical model than the existing solution gamut approximation scheme on a wide range of data sets.

Download Full-text

It is better an approximate answer to the right question than the exact answer to the wrong question : the case of the psychometric analysis of the ASQ:SE

10.31234/osf.io/a5tdf ◽

2020 ◽

Author(s):

Luis Anunciacao ◽

janet squires ◽

J. Landeira-Fernandez

Keyword(s):

Internal Structure ◽

Statistical Methods ◽

Principal Component ◽

Psychological Theory ◽

Published Data ◽

Multivariate Statistical ◽

Exact Answer ◽

Wide Range ◽

Ages And Stages Questionnaire ◽

The Right

One of the main activities in psychometrics is to analyze the internal structure of a test. Multivariate statistical methods, including Exploratory Factor analysis (EFA) and Principal Component Analysis (PCA) are frequently used to do this, but the growth of Network Analysis (NA) places this method as a promising candidate. The results obtained by these methods are of valuable interest, as they not only produce evidence to explore if the test is measuring its intended construct, but also to deal with the substantive theory that motivated the test development. However, these different statistical methods come up with different answers, providing the basis for different analytical and theoretical strategies when one needs to choose a solution. In this study, we took advantage of a large volume of published data (n = 22,331) obtained by the Ages and Stages Questionnaire Social-Emotional (ASQ:SE), and formed a subset of 500 children to present and discuss alternative psychometric solutions to its internal structure, and also to its subjacent theory. The analyses were based on a polychoric matrix, the number of factors to retain followed several well-known rules of thumb, and a wide range of exploratory methods was fitted to the data, including EFA, PCA, and NA. The statistical outcomes were divergent, varying from 1 to 6 domains, allowing a flexible interpretation of the results. We argue that the use of statistical methods in the absence of a well-grounded psychological theory has limited applications, despite its appeal. All data and codes are available at https://osf.io/z6gwv/.

Download Full-text

Clustering Single-Cell RNA-Seq Data with Regularized Gaussian Graphical Model

Genes ◽

10.3390/genes12020311 ◽

2021 ◽

Vol 12 (2) ◽

pp. 311

Author(s):

Zhenqiu Liu

Keyword(s):

Single Cell ◽

Free Parameter ◽

Graphical Model ◽

Expression Patterns ◽

Information Criterion ◽

Log P ◽

Rna Seq ◽

Clustering Methods ◽

Wide Range ◽

Free Parameters

Single-cell RNA-seq (scRNA-seq) is a powerful tool to measure the expression patterns of individual cells and discover heterogeneity and functional diversity among cell populations. Due to variability, it is challenging to analyze such data efficiently. Many clustering methods have been developed using at least one free parameter. Different choices for free parameters may lead to substantially different visualizations and clusters. Tuning free parameters is also time consuming. Thus there is need for a simple, robust, and efficient clustering method. In this paper, we propose a new regularized Gaussian graphical clustering (RGGC) method for scRNA-seq data. RGGC is based on high-order (partial) correlations and subspace learning, and is robust over a wide-range of a regularized parameter λ. Therefore, we can simply set λ=2 or λ=log(p) for AIC (Akaike information criterion) or BIC (Bayesian information criterion) without cross-validation. Cell subpopulations are discovered by the Louvain community detection algorithm that determines the number of clusters automatically. There is no free parameter to be tuned with RGGC. When evaluated with simulated and benchmark scRNA-seq data sets against widely used methods, RGGC is computationally efficient and one of the top performers. It can detect inter-sample cell heterogeneity, when applied to glioblastoma scRNA-seq data.

Download Full-text

Goal-driven active learning

Autonomous Agents and Multi-Agent Systems ◽

10.1007/s10458-021-09527-5 ◽

2021 ◽

Vol 35 (2) ◽

Author(s):

Nicolas Bougie ◽

Ryutaro Ichise

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Learning Process ◽

Real World ◽

Imitation Learning ◽

Learning Approaches ◽

Wide Range ◽

Fixed Set ◽

Complex Decision Making ◽

Complex Decision

AbstractDeep reinforcement learning methods have achieved significant successes in complex decision-making problems. In fact, they traditionally rely on well-designed extrinsic rewards, which limits their applicability to many real-world tasks where rewards are naturally sparse. While cloning behaviors provided by an expert is a promising approach to the exploration problem, learning from a fixed set of demonstrations may be impracticable due to lack of state coverage or distribution mismatch—when the learner’s goal deviates from the demonstrated behaviors. Besides, we are interested in learning how to reach a wide range of goals from the same set of demonstrations. In this work we propose a novel goal-conditioned method that leverages very small sets of goal-driven demonstrations to massively accelerate the learning process. Crucially, we introduce the concept of active goal-driven demonstrations to query the demonstrator only in hard-to-learn and uncertain regions of the state space. We further present a strategy for prioritizing sampling of goals where the disagreement between the expert and the policy is maximized. We evaluate our method on a variety of benchmark environments from the Mujoco domain. Experimental results show that our method outperforms prior imitation learning approaches in most of the tasks in terms of exploration efficiency and average scores.

Download Full-text

A CFD Study of Focused Extreme Wave Impact on Decks of Offshore Structures

Volume 2: CFD and VIV ◽

10.1115/omae2014-23804 ◽

2014 ◽

Cited By ~ 2

Author(s):

Xin Lu ◽

Pankaj Kumar ◽

Anand Bahuguni ◽

Yanling Wu

Keyword(s):

Real World ◽

Offshore Structures ◽

Air Gap ◽

Irregular Waves ◽

Linear Wave ◽

Extreme Wave ◽

Wave Impact ◽

Loading Test ◽

Wide Range ◽

Wave Maker

The design of offshore structures for extreme/abnormal waves assumes that there is sufficient air gap such that waves will not hit the platform deck. Due to inaccuracies in the predictions of extreme wave crests in addition to settlement or sea-level increases, the required air gap between the crest of the extreme wave and the deck is often inadequate in existing platforms and therefore wave-in-deck loads need to be considered when assessing the integrity of such platforms. The problem of wave-in-deck loading involves very complex physics and demands intensive study. In the Computational Fluid Mechanics (CFD) approach, two critical issues must be addressed, namely the efficient, realistic numerical wave maker and the accurate free surface capturing methodology. Most reported CFD research on wave-in-deck loads consider regular waves only, for instance the Stokes fifth-order waves. They are, however, recognized by designers as approximate approaches since “real world” sea states consist of random irregular waves. In our work, we report a recently developed focused extreme wave maker based on the NewWave theory. This model can better approximate the “real world” conditions, and is more efficient than conventional random wave makers. It is able to efficiently generate targeted waves at a prescribed time and location. The work is implemented and integrated with OpenFOAM, an open source platform that receives more and more attention in a wide range of industrial applications. We will describe the developed numerical method of predicting highly non-linear wave-in-deck loads in the time domain. The model’s capability is firstly demonstrated against 3D model testing experiments on a fixed block with various deck orientations under random waves. A detailed loading analysis is conducted and compared with available numerical and measurement data. It is then applied to an extreme wave loading test on a selected bridge with multiple under-deck girders. The waves are focused extreme irregular waves derived from NewWave theory and JONSWAP spectra.

Download Full-text

SCAPP: an algorithm for improved plasmid assembly in metagenomes

Microbiome ◽

10.1186/s40168-021-01068-z ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

David Pellow ◽

Alvah Zorea ◽

Maraike Probst ◽

Ori Furman ◽

Arik Segal ◽

...

Keyword(s):

Bacterial Species ◽

Bacterial Genome ◽

Biological Knowledge ◽

Assessment Procedure ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Double Stranded Dna ◽

Wide Range ◽

Python Package

Abstract Background Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. Results We developed SCAPP (Sequence Contents-Aware Plasmid Peeler)—an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created plasmidome and metagenome data from the same cow rumen sample and used the parallel sequencing data to create a novel assessment procedure. Overall, SCAPP outperformed Recycler and metaplasmidSPAdes across this wide range of datasets. Conclusions SCAPP is an easy to use Python package that enables the assembly of full plasmid sequences from metagenomic samples. It outperformed existing metagenomic plasmid assemblers in most cases and assembled novel and clinically relevant plasmids in samples we generated such as a human gut plasmidome. SCAPP is open-source software available from: https://github.com/Shamir-Lab/SCAPP.

Download Full-text

An information theoretic approach to link prediction in multiplex networks

Scientific Reports ◽

10.1038/s41598-021-92427-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Seyed Hossein Jafari ◽

Amir Mahdi Abdolhosseini-Qomi ◽

Masoud Asadpour ◽

Maseud Rahgozar ◽

Naser Yazdani

Keyword(s):

Real World ◽

Link Prediction ◽

Large Scale ◽

Similarity Measures ◽

Prediction Method ◽

General Purpose ◽

Fast Method ◽

Theoretic Approach ◽

Multiplex Networks ◽

Wide Range

AbstractThe entities of real-world networks are connected via different types of connections (i.e., layers). The task of link prediction in multiplex networks is about finding missing connections based on both intra-layer and inter-layer correlations. Our observations confirm that in a wide range of real-world multiplex networks, from social to biological and technological, a positive correlation exists between connection probability in one layer and similarity in other layers. Accordingly, a similarity-based automatic general-purpose multiplex link prediction method—SimBins—is devised that quantifies the amount of connection uncertainty based on observed inter-layer correlations in a multiplex network. Moreover, SimBins enhances the prediction quality in the target layer by incorporating the effect of link overlap across layers. Applying SimBins to various datasets from diverse domains, our findings indicate that SimBins outperforms the compared methods (both baseline and state-of-the-art methods) in most instances when predicting links. Furthermore, it is discussed that SimBins imposes minor computational overhead to the base similarity measures making it a potentially fast method, suitable for large-scale multiplex networks.

Download Full-text