Sandpiper: Scaling probabilistic inferencing to large scale graphical models

Author(s):  
Alexander Ulanov ◽  
Manish Marwah ◽  
Mijung Kim ◽  
Roshan Dathathri ◽  
Carlos Zubieta ◽  
...  
Keyword(s):  
2017 ◽  
Author(s):  
Stefano Beretta ◽  
Mauro Castelli ◽  
Ivo Gonçalves ◽  
Ivan Merelli ◽  
Daniele Ramazzotti

AbstractGene and protein networks are very important to model complex large-scale systems in molecular biology. Inferring or reverseengineering such networks can be defined as the process of identifying gene/protein interactions from experimental data through computational analysis. However, this task is typically complicated by the enormously large scale of the unknowns in a rather small sample size. Furthermore, when the goal is to study causal relationships within the network, tools capable of overcoming the limitations of correlation networks are required. In this work, we make use of Bayesian Graphical Models to attach this problem and, specifically, we perform a comparative study of different state-of-the-art heuristics, analyzing their performance in inferring the structure of the Bayesian Network from breast cancer data.


2009 ◽  
Vol 21 (11) ◽  
pp. 3010-3056 ◽  
Author(s):  
Shai Litvak ◽  
Shimon Ullman

In this letter, we develop and simulate a large-scale network of spiking neurons that approximates the inference computations performed by graphical models. Unlike previous related schemes, which used sum and product operations in either the log or linear domains, the current model uses an inference scheme based on the sum and maximization operations in the log domain. Simulations show that using these operations, a large-scale circuit, which combines populations of spiking neurons as basic building blocks, is capable of finding close approximations to the full mathematical computations performed by graphical models within a few hundred milliseconds. The circuit is general in the sense that it can be wired for any graph structure, it supports multistate variables, and it uses standard leaky integrate-and-fire neuronal units. Following previous work, which proposed relations between graphical models and the large-scale cortical anatomy, we focus on the cortical microcircuitry and propose how anatomical and physiological aspects of the local circuitry may map onto elements of the graphical model implementation. We discuss in particular the roles of three major types of inhibitory neurons (small fast-spiking basket cells, large layer 2/3 basket cells, and double-bouquet neurons), subpopulations of strongly interconnected neurons with their unique connectivity patterns in different cortical layers, and the possible role of minicolumns in the realization of the population-based maximum operation.


2020 ◽  
Author(s):  
Sacha Epskamp ◽  
Adela-Maria Isvoranu ◽  
Mike W.-L. Cheung

A growing number of publications focuses on estimating Gaussian graphical models (GGMs, networks of partial correlation coefficients). At the same time, generalizibility and replicability of these highly parameterized models are debated, and sample sizes typically found in datasets may not be sufficient for estimating the underlying network structure. In addition, while recent work emerged that aims to compare networks based on different samples, these studies do not take potential cross-study heterogeneity into account. To this end, this paper introduces methods for estimating GGMs through aggregating over multiple datasets. We first introduce a general maximum likelihood estimation modeling framework in which all discussed models are embedded. This modeling framework is subsequently used to introduce meta-analytic Gaussian network aggregation (MAGNA). We discuss two variants: fixed-effects MAGNA, in which heterogeneity across studies is not taken into account, and random-effects MAGNA, which models sample correlations and takes heterogeneity into account. We exemplify the method using four datasets of post-traumatic stress disorder symptoms, as well as one large dataset of depression, anxiety and stress symptoms. Finally, we assess the performance of MAGNA in large-scale simulation studies.


2015 ◽  
Author(s):  
Alina Frolova ◽  
Bartek Wilczynski

AbstractBackgroundBayesian networks are directed acyclic graphical models widely used to represent the probabilistic relationships between random variables. They have been applied in various biological contexts, including gene regulatory networks and protein-protein interactions inference. Generally, learning Bayesian networks from experimental data is NP-hard, leading to widespread use of heuristic search methods giving suboptimal results. However, in cases when the acyclicity of the graph can be externally ensured, it is possible to find the optimal network in polynomial time. While our previously developed tool BNFinder implements polynomial time algorithm, reconstructing networks with the large amount of experimental data still leads to computations on single CPU growing exceedingly.ResultsIn the present paper we propose parallelized algorithm designed for multi-core and distributed systems and its implementation in the improved version of BNFinder - tool for learning optimal Bayesian networks. The new algorithm has been tested on different simulated and experimental datasets showing that it has much better efficiency of parallelization than the previous version. BNFinder gives comparable results in terms of accuracy with respect to current state-of-the-art inference methods, giving significant advantage in cases when external information such as regulators list or prior edge probability can be introduced.ConclusionsWe show that the new method can be used to reconstruct networks in the size range of thousands of genes making it practically applicable to whole genome datasets of prokaryotic systems and large components of eukaryotic genomes. Our benchmarking results on realistic datasets indicate that the tool should be useful to wide audience of researchers interested in discovering dependencies in their large-scale transcriptomic datasets.


2018 ◽  
Author(s):  
Janko Tackmann ◽  
João Frederico Matias Rodrigues ◽  
Christian von Mering

AbstractThe recent explosion of metagenomic sequencing data opens the door towards the modeling of microbial ecosystems in unprecedented detail. In particular, co-occurrence based prediction of ecological interactions could strongly benefit from this development. However, current methods fall short on several fronts: univariate tools do not distinguish between direct and indirect interactions, resulting in excessive false positives, while approaches with better resolution are so far computationally highly limited. Furthermore, confounding variables typical for cross-study data sets are rarely addressed. We present FlashWeave, a new approach based on a flexible Probabilistic Graphical Models framework to infer highly resolved direct microbial interactions from massive heterogeneous microbial abundance data sets with seamless integration of metadata. On a variety of benchmarks, FlashWeave outperforms state-of-the-art methods by several orders of magnitude in terms of speed while generally providing increased accuracy. We apply FlashWeave to a cross-study data set of 69 818 publicly available human gut samples, resulting in one of the largest and most diverse models of microbial interactions in the human gut to date.


2019 ◽  
Author(s):  
Arshdeep Sekhon ◽  
Beilun Wang ◽  
Yanjun Qi

AbstractWe focus on integrating different types of extra knowledge (other than the observed samples) for estimating the sparse structure change between two p-dimensional Gaussian Graphical Models (i.e. differential GGMs). Previous differential GGM estimators either fail to include additional knowledge or cannot scale up to a high-dimensional (large p) situation. This paper proposes a novel method KDiffNet that incorporates Additional Knowledge in identifying Differential Networks via an Elementary Estimator. We design a novel hybrid norm as a superposition of two structured norms guided by the extra edge information and the additional node group knowledge. KDiffNet is solved through a fast parallel proximal algorithm, enabling it to work in large-scale settings. KDiffNet can incorporate various combinations of existing knowledge without re-designing the optimization. Through rigorous statistical analysis we show that, while considering more evidence, KDiffNet achieves the same convergence rate as the state-of-the-art. Empirically on multiple synthetic datasets and one real-world fMRI brain data, KDiffNet significantly outperforms the cutting edge baselines with regard to the prediction performance, while achieving the same level of time cost or less.


Genes ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 167 ◽  
Author(s):  
Qingyang Zhang

The nonparanormal graphical model has emerged as an important tool for modeling dependency structure between variables because it is flexible to non-Gaussian data while maintaining the good interpretability and computational convenience of Gaussian graphical models. In this paper, we consider the problem of detecting differential substructure between two nonparanormal graphical models with false discovery rate control. We construct a new statistic based on a truncated estimator of the unknown transformation functions, together with a bias-corrected sample covariance. Furthermore, we show that the new test statistic converges to the same distribution as its oracle counterpart does. Both synthetic data and real cancer genomic data are used to illustrate the promise of the new method. Our proposed testing framework is simple and scalable, facilitating its applications to large-scale data. The computational pipeline has been implemented in the R package DNetFinder, which is freely available through the Comprehensive R Archive Network.


Psychometrika ◽  
2021 ◽  
Author(s):  
Sacha Epskamp ◽  
Adela-Maria Isvoranu ◽  
Mike W.-L. Cheung

AbstractA growing number of publications focus on estimating Gaussian graphical models (GGM, networks of partial correlation coefficients). At the same time, generalizibility and replicability of these highly parameterized models are debated, and sample sizes typically found in datasets may not be sufficient for estimating the underlying network structure. In addition, while recent work emerged that aims to compare networks based on different samples, these studies do not take potential cross-study heterogeneity into account. To this end, this paper introduces methods for estimating GGMs by aggregating over multiple datasets. We first introduce a general maximum likelihood estimation modeling framework in which all discussed models are embedded. This modeling framework is subsequently used to introduce meta-analytic Gaussian network aggregation (MAGNA). We discuss two variants: fixed-effects MAGNA, in which heterogeneity across studies is not taken into account, and random-effects MAGNA, which models sample correlations and takes heterogeneity into account. We assess the performance of MAGNA in large-scale simulation studies. Finally, we exemplify the method using four datasets of post-traumatic stress disorder (PTSD) symptoms, and summarize findings from a larger meta-analysis of PTSD symptom.


Sign in / Sign up

Export Citation Format

Share Document