scholarly journals Counting and Sampling from Markov Equivalent DAGs Using Clique Trees

Author(s):  
AmirEmad Ghassami ◽  
Saber Salehkaleybar ◽  
Negar Kiyavash ◽  
Kun Zhang

A directed acyclic graph (DAG) is the most common graphical model for representing causal relationships among a set of variables. When restricted to using only observational data, the structure of the ground truth DAG is identifiable only up to Markov equivalence, based on conditional independence relations among the variables. Therefore, the number of DAGs equivalent to the ground truth DAG is an indicator of the causal complexity of the underlying structure–roughly speaking, it shows how many interventions or how much additional information is further needed to recover the underlying DAG. In this paper, we propose a new technique for counting the number of DAGs in a Markov equivalence class. Our approach is based on the clique tree representation of chordal graphs. We show that in the case of bounded degree graphs, the proposed algorithm is polynomial time. We further demonstrate that this technique can be utilized for uniform sampling from a Markov equivalence class, which provides a stochastic way to enumerate DAGs in the equivalence class and may be needed for finding the best DAG or for causal inference given the equivalence class as input. We also extend our counting and sampling method to the case where prior knowledge about the underlying DAG is available, and present applications of this extension in causal experiment design and estimating the causal effect of joint interventions.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Adèle Weber Zendrera ◽  
Nataliya Sokolovska ◽  
Hédi A. Soula

AbstractIn this manuscript, we propose a novel approach to assess relationships between environment and metabolic networks. We used a comprehensive dataset of more than 5000 prokaryotic species from which we derived the metabolic networks. We compute the scope from the reconstructed graphs, which is the set of all metabolites and reactions that can potentially be synthesized when provided with external metabolites. We show using machine learning techniques that the scope is an excellent predictor of taxonomic and environmental variables, namely growth temperature, oxygen tolerance, and habitat. In the literature, metabolites and pathways are rarely used to discriminate species. We make use of the scope underlying structure—metabolites and pathways—to construct the predictive models, giving additional information on the important metabolic pathways needed to discriminate the species, which is often absent in other metabolic network properties. For example, in the particular case of growth temperature, glutathione biosynthesis pathways are specific to species growing in cold environments, whereas tungsten metabolism is specific to species in warm environments, as was hinted in current literature. From a machine learning perspective, the scope is able to reduce the dimension of our data, and can thus be considered as an interpretable graph embedding.



2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Sakthi Kumar Arul Prakash ◽  
Conrad Tucker

AbstractThis work investigates the ability to classify misinformation in online social media networks in a manner that avoids the need for ground truth labels. Rather than approach the classification problem as a task for humans or machine learning algorithms, this work leverages user–user and user–media (i.e.,media likes) interactions to infer the type of information (fake vs. authentic) being spread, without needing to know the actual details of the information itself. To study the inception and evolution of user–user and user–media interactions over time, we create an experimental platform that mimics the functionality of real-world social media networks. We develop a graphical model that considers the evolution of this network topology to model the uncertainty (entropy) propagation when fake and authentic media disseminates across the network. The creation of a real-world social media network enables a wide range of hypotheses to be tested pertaining to users, their interactions with other users, and with media content. The discovery that the entropy of user–user and user–media interactions approximate fake and authentic media likes, enables us to classify fake media in an unsupervised learning manner.



2021 ◽  
Author(s):  
Shikha Suman ◽  
Ashutosh Karna ◽  
Karina Gibert

Hierarchical clustering is one of the most preferred choices to understand the underlying structure of a dataset and defining typologies, with multiple applications in real life. Among the existing clustering algorithms, the hierarchical family is one of the most popular, as it permits to understand the inner structure of the dataset and find the number of clusters as an output, unlike popular methods, like k-means. One can adjust the granularity of final clustering to the goals of the analysis themselves. The number of clusters in a hierarchical method relies on the analysis of the resulting dendrogram itself. Experts have criteria to visually inspect the dendrogram and determine the number of clusters. Finding automatic criteria to imitate experts in this task is still an open problem. But, dependence on the expert to cut the tree represents a limitation in real applications like the fields industry 4.0 and additive manufacturing. This paper analyses several cluster validity indexes in the context of determining the suitable number of clusters in hierarchical clustering. A new Cluster Validity Index (CVI) is proposed such that it properly catches the implicit criteria used by experts when analyzing dendrograms. The proposal has been applied on a range of datasets and validated against experts ground-truth overcoming the results obtained by the State of the Art and also significantly reduces the computational cost.



Author(s):  
Topi Talvitie ◽  
Mikko Koivisto

Exploring directed acyclic graphs (DAGs) in a Markov equivalence class is pivotal to infer causal effects or to discover the causal DAG via appropriate interventional data. We consider counting and uniform sampling of DAGs that are Markov equivalent to a given DAG. These problems efficiently reduce to counting the moral acyclic orientations of a given undirected connected chordal graph on n vertices, for which we give two algorithms. Our first algorithm requires O(2nn4) arithmetic operations, improving a previous superexponential upper bound. The second requires O(k!2kk2n) operations, where k is the size of the largest clique in the graph; for bounded-degree graphs this bound is linear in n. After a single run, both algorithms enable uniform sampling from the equivalence class at a computational cost linear in the graph size. Empirical results indicate that our algorithms are superior to previously presented algorithms over a range of inputs; graphs with hundreds of vertices and thousands of edges are processed in a second on a desktop computer.



Geosciences ◽  
2018 ◽  
Vol 8 (12) ◽  
pp. 455 ◽  
Author(s):  
Timo Gaida ◽  
Tengku Tengku Ali ◽  
Mirjam Snellen ◽  
Alireza Amiri-Simkooei ◽  
Thaiënne van Dijk ◽  
...  

Multi-frequency backscatter data collected from multibeam echosounders (MBESs) is increasingly becoming available. The ability to collect data at multiple frequencies at the same time is expected to allow for better discrimination between seabed sediments. We propose an extension of the Bayesian method for seabed classification to multi-frequency backscatter. By combining the information retrieved at single frequencies we produce a multispectral acoustic classification map, which allows us to distinguish more seabed environments. In this study we use three triple-frequency (100, 200, and 400 kHz) backscatter datasets acquired with an R2Sonic 2026 in the Bedford Basin, Canada in 2016 and 2017 and in the Patricia Bay, Canada in 2016. The results are threefold: (1) combining 100 and 400 kHz, in general, reveals the most additional information about the seabed; (2) the use of multiple frequencies allows for a better acoustic discrimination of seabed sediments than single-frequency data; and (3) the optimal frequency selection for acoustic sediment classification depends on the local seabed. However, a quantification of the benefit using multiple frequencies cannot clearly be determined based on the existing ground-truth data. Still, a qualitative comparison and a geological interpretation indicate an improved discrimination between different seabed environments using multi-frequency backscatter.



1987 ◽  
Vol 9 ◽  
pp. 253
Author(s):  
N. Young ◽  
I. Goodwin

Ground surveys of the ice sheet in Wilkes Land, Antarctica, have been made on oversnow traverses operating out of Casey. Data collected include surface elevation, accumulation rate, snow temperature, and physical characteristics of the snow cover. By the nature of the surveys, the data are mostly restricted to line profiles. In some regions, aerial surveys of surface topography have been made over a grid network. Satellite imagery and remote sensing are two means of extrapolating the results from measurements along lines to an areal presentation. They are also the only source of data over large areas of the continent. Landsat images in the visible and near infra-red wavelengths clearly depict many of the large- and small scale features of the surface. The intensity of the reflected radiation varies with the aspect and magnitude of the surface slope to reveal the surface topography. The multi-channel nature of the Landsat data is exploited to distinguish between different surface types through their different spectral signatures, e.g. bare ice, glaze, snow, etc. Additional information on surface type can be gained at a coarser scale from other satellite-borne sensors such as ESMR, SMMR, etc. Textural enhancement of the Landsat images reveals the surface micro-relief. Features in the enhanced images are compared to ground-truth data from the traverse surveys to produce a classification of surface types across the images and to determine the magnitude of the surface topography and micro-relief observed. The images can then be used to monitor changes over time.



2019 ◽  
Vol 31 (8) ◽  
pp. 1671-1717 ◽  
Author(s):  
Jérôme Tubiana ◽  
Simona Cocco ◽  
Rémi Monasson

A restricted Boltzmann machine (RBM) is an unsupervised machine learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. RBMs were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the features learned by RBM changes with its defining parameters, such as the dimensionality of the representations (size of the hidden layer) and the sparsity of the features. We show that for adequate values of these parameters, RBMs operate in a so-called compositional phase in which visible configurations sampled from the RBM are obtained by recombining these features. We then compare the performance of RBM with other standard representation learning algorithms, including principal or independent component analysis (PCA, ICA), autoencoders (AE), variational autoencoders (VAE), and their sparse variants. We show that RBMs, due to the stochastic mapping between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA. In addition, this stochastic mapping is not prescribed a priori as in VAE, but learned from data, which allows RBMs to show good performance even with shallow architectures. All numerical results are illustrated on synthetic lattice protein data that share similar statistical features with real protein sequences and for which ground-truth interactions are known.



2015 ◽  
Vol 54 (9) ◽  
pp. 1861-1870 ◽  
Author(s):  
Jeffrey C. Snyder ◽  
Alexander V. Ryzhkov

AbstractAlthough radial velocity data from Doppler radars can partially resolve some tornadoes, particularly large tornadoes near the radar, most tornadoes are not explicitly resolved by radar owing to inadequate spatiotemporal resolution. In addition, it can be difficult to determine which mesocyclones typically observed on radar are associated with tornadoes. Since debris lofted by tornadoes has scattering characteristics that are distinct from those of hydrometeors, the additional information provided by polarimetric weather radars can aid in identifying debris from tornadoes; the polarimetric tornadic debris signature (TDS) provides what is nearly “ground truth” that a tornado is ongoing (or has recently occurred). This paper outlines a modification to the hydrometeor classification algorithm used with the operational Weather Surveillance Radar-1988 Doppler (WSR-88D) network in the United States to include a TDS category. Examples of automated TDS classification are provided for several recent cases that were observed in the United States.



Author(s):  
Hadi Hosseini ◽  
Debmalya Mandal ◽  
Nisarg Shah ◽  
Kevin Shi

The wisdom of the crowd has long become the de facto approach for eliciting information from individuals or experts in order to predict the ground truth. However, classical democratic approaches for aggregating individual \emph{votes} only work when the opinion of the majority of the crowd is relatively accurate. A clever recent approach, \emph{surprisingly popular voting}, elicits additional information from the individuals, namely their \emph{prediction} of other individuals' votes, and provably recovers the ground truth even when experts are in minority. This approach works well when the goal is to pick the correct option from a small list, but when the goal is to recover a true ranking of the alternatives, a direct application of the approach requires eliciting too much information. We explore practical techniques for extending the surprisingly popular algorithm to ranked voting by partial votes and predictions and designing robust aggregation rules. We experimentally demonstrate that even a little prediction information helps surprisingly popular voting outperform classical approaches.



eLife ◽  
2021 ◽  
Vol 10 ◽  
Author(s):  
Shivesh Chaudhary ◽  
Sol Ah Lee ◽  
Yueyi Li ◽  
Dhaval S Patel ◽  
Hang Lu

Although identifying cell names in dense image stacks is critical in analyzing functional whole-brain data enabling comparison across experiments, unbiased identification is very difficult, and relies heavily on researchers' experiences. Here we present a probabilistic-graphical-model framework, CRF_ID, based on Conditional Random Fields, for unbiased and automated cell identification. CRF_ID focuses on maximizing intrinsic similarity between shapes. Compared to existing methods, CRF_ID achieves higher accuracy on simulated and ground-truth experimental datasets, and better robustness against challenging noise conditions common in experimental data. CRF_ID can further boost accuracy by building atlases from annotated data in highly computationally efficient manner, and by easily adding new features (e.g. from new strains). We demonstrate cell annotation in C. elegans images across strains, animal orientations, and tasks including gene-expression localization, multi-cellular and whole-brain functional imaging experiments. Together, these successes demonstrate that unbiased cell annotation can facilitate biological discovery, and this approach may be valuable to annotation tasks for other systems.



Sign in / Sign up

Export Citation Format

Share Document