Message Passing-Based Inference for Time-Varying Autoregressive Models

Time-varying autoregressive (TVAR) models are widely used for modeling of non-stationary signals. Unfortunately, online joint adaptation of both states and parameters in these models remains a challenge. In this paper, we represent the TVAR model by a factor graph and solve the inference problem by automated message passing-based inference for states and parameters. We derive structured variational update rules for a composite “AR node” with probabilistic observations that can be used as a plug-in module in hierarchical models, for example, to model the time-varying behavior of the hyper-parameters of a time-varying AR model. Our method includes tracking of variational free energy (FE) as a Bayesian measure of TVAR model performance. The proposed methods are verified on a synthetic data set and validated on real-world data from temperature modeling and speech enhancement tasks.

Download Full-text

A Spatial Biosurveillance Synthetic Data Generator in R

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7583 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Drew Levin ◽

Patrick Finley

Keyword(s):

Power Law ◽

Real World ◽

Degree Distribution ◽

Transportation Network ◽

Synthetic Data ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Scale Free ◽

Data Generator

ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data

Download Full-text

Uncertainty Visualization of Transport Variance in a Time-Varying Ensemble Vector Field

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9010019 ◽

2020 ◽

Vol 9 (1) ◽

pp. 19

Author(s):

Ke Ren ◽

Dezhan Qu ◽

Shaobin Xu ◽

Xufeng Jiao ◽

Liang Tai ◽

...

Keyword(s):

Vector Field ◽

Uncertainty Analysis ◽

Visual Analysis ◽

Synthetic Data ◽

Real Sequence ◽

Complex Data ◽

Time Varying ◽

Data Set ◽

Spatial Continuity ◽

Transport Patterns

Uncertainty analysis of a time-varying ensemble vector field is a challenging topic in geoscience. Due to the complex data structure, the uncertainty of a time-varying ensemble vector field is hard to quantify and analyze. Measuring the differences between pathlines is an effective way to compute the uncertainty. However, existing metrics are not accurate enough or are sensitive to outliers; thus, a comprehensive tool for the further analysis of the uncertainty of transport patterns is required. In this paper, we propose a novel framework for quantifying and analyzing the uncertainty of an ensemble vector field. Based on the classical edit distance on real sequence (EDR) method, a robust and accurate metric was proposed to measure the pathline uncertainty. Considering the spatial continuity, we computed the transport variance of the neighborhood of a location, and evaluated the uncertainty correlation between each location and its neighborhood by using the local Moran’s I. Based on the proposed uncertainty measurements, a visual analysis system called UP-Vis (uncertainty pathline visualization) was developed to interactively explore the uncertainty. It provides an overview of the uncertainty and supports detailed exploration of transport patterns at a selected location, and allows for the comparison of transport patterns between a location and its neighborhood. Through pathline clustering, the major trends of the ensemble pathline at a location were extracted. Moreover, a glyph was designed to intuitively display the transport direction and diverging degree of each cluster. For the uncertainty analysis of the neighborhood, a comparison view was designed to compare the transport patterns between a location and its neighborhood in detail. A synthetic data set and weather simulation data set were used in our experiments. The evaluation and case studies demonstrated that the proposed framework can measure the uncertainty effectively and help users to comprehensively explore uncertainty transport patterns.

Download Full-text

Variational Message Passing and Local Constraint Manipulation in Factor Graphs

Entropy ◽

10.3390/e23070807 ◽

2021 ◽

Vol 23 (7) ◽

pp. 807

Author(s):

İsmail Şenöz ◽

Thijs van de Laar ◽

Dmitry Bagaev ◽

Bert de de Vries

Keyword(s):

Free Energy ◽

Latent Variables ◽

Message Passing ◽

Factor Graph ◽

Estimation Accuracy ◽

Attractive Alternative ◽

Factor Graphs ◽

Data Set ◽

Bethe Free Energy ◽

Model Evidence

Accurate evaluation of Bayesian model evidence for a given data set is a fundamental problem in model development. Since evidence evaluations are usually intractable, in practice variational free energy (VFE) minimization provides an attractive alternative, as the VFE is an upper bound on negative model log-evidence (NLE). In order to improve tractability of the VFE, it is common to manipulate the constraints in the search space for the posterior distribution of the latent variables. Unfortunately, constraint manipulation may also lead to a less accurate estimate of the NLE. Thus, constraint manipulation implies an engineering trade-off between tractability and accuracy of model evidence estimation. In this paper, we develop a unifying account of constraint manipulation for variational inference in models that can be represented by a (Forney-style) factor graph, for which we identify the Bethe Free Energy as an approximation to the VFE. We derive well-known message passing algorithms from first principles, as the result of minimizing the constrained Bethe Free Energy (BFE). The proposed method supports evaluation of the BFE in factor graphs for model scoring and development of new message passing-based inference algorithms that potentially improve evidence estimation accuracy.

Download Full-text

Nonparametric e-Mixture Estimation

Neural Computation ◽

10.1162/neco_a_00888 ◽

2016 ◽

Vol 28 (12) ◽

pp. 2687-2725 ◽

Cited By ~ 4

Author(s):

Ken Takano ◽

Hideitsu Hino ◽

Shotaro Akaho ◽

Noboru Murata

Keyword(s):

Synthetic Data ◽

Estimation Algorithm ◽

Data Sets ◽

Real World Data ◽

Nonparametric Modeling ◽

Data Set ◽

Target Distribution ◽

Nonparametric Models ◽

Research Fields ◽

Using Data

This study considers the common situation in data analysis when there are few observations of the distribution of interest or the target distribution, while abundant observations are available from auxiliary distributions. In this situation, it is natural to compensate for the lack of data from the target distribution by using data sets from these auxiliary distributions—in other words, approximating the target distribution in a subspace spanned by a set of auxiliary distributions. Mixture modeling is one of the simplest ways to integrate information from the target and auxiliary distributions in order to express the target distribution as accurately as possible. There are two typical mixtures in the context of information geometry: the [Formula: see text]- and [Formula: see text]-mixtures. The [Formula: see text]-mixture is applied in a variety of research fields because of the presence of the well-known expectation-maximazation algorithm for parameter estimation, whereas the [Formula: see text]-mixture is rarely used because of its difficulty of estimation, particularly for nonparametric models. The [Formula: see text]-mixture, however, is a well-tempered distribution that satisfies the principle of maximum entropy. To model a target distribution with scarce observations accurately, this letter proposes a novel framework for a nonparametric modeling of the [Formula: see text]-mixture and a geometrically inspired estimation algorithm. As numerical examples of the proposed framework, a transfer learning setup is considered. The experimental results show that this framework works well for three types of synthetic data sets, as well as an EEG real-world data set.

Download Full-text

Improving the virtual source method by wavefield separation

Geophysics ◽

10.1190/1.2733020 ◽

2007 ◽

Vol 72 (4) ◽

pp. V79-V86 ◽

Cited By ~ 118

Author(s):

Kurang Mehta ◽

Andrey Bakulin ◽

Jonathan Sheiman ◽

Rodney Calvert ◽

Roel Snieder

Keyword(s):

Seismic Data ◽

Synthetic Data ◽

Ocean Bottom ◽

Time Varying ◽

Virtual Source ◽

Data Set ◽

Source Method ◽

Wavefield Separation ◽

Source Data

The virtual source method has recently been proposed to image and monitor below complex and time-varying overburden. The method requires surface shooting recorded at downhole receivers placed below the distorting or changing part of the overburden. Redatuming with the measured Green’s function allows the reconstruction of a complete downhole survey as if the sources were also buried at the receiver locations. There are still some challenges that need to be addressed in the virtual source method, such as limited acquisition aperture and energy coming from the overburden. We demonstrate that up-down wavefield separation can substantially improve the quality of virtual source data. First, it allows us to eliminate artifacts associated with the limited acquisition aperture typically used in practice. Second, it allows us to reconstruct a new optimized response in the absence of downgoing reflections and multiples from the overburden. These improvements are illustrated on a synthetic data set of a complex layered model modeled after the Fahud field in Oman, and on ocean-bottom seismic data acquired in the Mars field in the deepwater Gulf of Mexico.

Download Full-text

Streaming changepoint detection for transition matrices

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00747-7 ◽

2021 ◽

Author(s):

Joshua Plasse ◽

Henrique Hoeltgebaum ◽

Niall M. Adams

Keyword(s):

Electricity Market ◽

Data Streams ◽

Transition Matrix ◽

Synthetic Data ◽

Estimation Procedure ◽

Streaming Data ◽

Control Parameters ◽

Transition Matrices ◽

Real World Data ◽

Data Set

AbstractSequentially detecting multiple changepoints in a data stream is a challenging task. Difficulties relate to both computational and statistical aspects, and in the latter, specifying control parameters is a particular problem. Choosing control parameters typically relies on unrealistic assumptions, such as the distributions generating the data, and their parameters, being known. This is implausible in the streaming paradigm, where several changepoints will exist. Further, current literature is mostly concerned with streams of continuous-valued observations, and focuses on detecting a single changepoint. There is a dearth of literature dedicated to detecting multiple changepoints in transition matrices, which arise from a sequence of discrete states. This paper makes the following contributions: a complete framework is developed for adaptively and sequentially estimating a Markov transition matrix in the streaming data setting. A change detection method is then developed, using a novel moment matching technique, which can effectively monitor for multiple changepoints in a transition matrix. This adaptive detection and estimation procedure for transition matrices, referred to as ADEPT-M, is compared to several change detectors on synthetic data streams, and is implemented on two real-world data streams – one consisting of over nine million HTTP web requests, and the other being a well-studied electricity market data set.

Download Full-text

Secular Form to Affiliation Rule Mining Employing P-tree and T-tree

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.44 ◽

2017 ◽

Vol 7 (8) ◽

pp. 155

Author(s):

Aparna Agarwal ◽

Deevyankar Agarwal

Keyword(s):

Association Rule ◽

Synthetic Data ◽

Dependent Data ◽

Time Varying ◽

Time Dimension ◽

Rule Mining ◽

Time Intervals ◽

Data Set ◽

Wide Range ◽

Tree Data

The real commercialisminformation often demonstrates temporal feature and time varying behavior. Temporal affiliation rule has thus got an active area of explore. A calendar part such as months and days, clock parts such as hours and seconds and differentiated units such as business days and academic years, act a major role in a wide range of information system applications. The calendar-based form has already been proposed by explorers to restrict the time-based association ships. This paper advises a novel algorithmic program to determine association rule on time dependent data employingeffective T tree and P-tree data structures. The algorithm complicates the significant advantage in terms of time and memory while comprising time dimension. Our approach path of scanning based on time-intervals yields littlerinformation set for a given valid interval thus cutting down the processing time. This approach is enforced on a synthetic data-set and result shows that temporal TFP tree collapses better performance over a TFP tree access.

Download Full-text

An empirical analysis of dealing with patients who are lost to follow-up when developing prognostic models using a cohort design

10.21203/rs.3.rs-54715/v2 ◽

2020 ◽

Author(s):

Jenna M Reps ◽

Peter Rijnbeek ◽

Alana Cuthbert ◽

Patrick B Ryan ◽

Nicole Pratt ◽

...

Keyword(s):

At Risk ◽

Real World ◽

Synthetic Data ◽

Model Performance ◽

Real World Data ◽

Test Set ◽

Loss To Follow Up ◽

Cohort Design ◽

Lost To Follow Up

Abstract Background: Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.Methods: We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: i) include all patients (including those lost to follow-up), ii) exclude all patients lost to follow-up or iii) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance.Results: The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on the model performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year, but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided.Conclusion: Based on this study we therefore recommend i) developing models using data that includes patients that are lost to follow-up and ii) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.

Download Full-text

Building Attention and Edge Convolution Neural Networks for Bioactivity and Physical-Chemical Property Prediction

10.26434/chemrxiv.9873599.v2 ◽

2019 ◽

Cited By ~ 1

Author(s):

Michael Withnall ◽

Edvard Lindelöf ◽

Ola Engkvist ◽

Hongming Chen

Keyword(s):

Message Passing ◽

State Of The Art ◽

A Priori ◽

Model Performance ◽

Learning Approaches ◽

Physical Chemical ◽

Chemical Descriptor ◽

Derived Properties ◽

Memory Schemes ◽

Hyperparameter Selection

We introduce Attention and Edge Memory schemes to the existing Message Passing Neural Network framework for graph convolution, and benchmark our approaches against eight different physical-chemical and bioactivity datasets from the literature. We remove the need to introduce <i>a priori</i> knowledge of the task and chemical descriptor calculation by using only fundamental graph-derived properties. Our results consistently perform on-par with other state-of-the-art machine learning approaches, and set a new standard on sparse multi-task virtual screening targets. We also investigate model performance as a function of dataset preprocessing, and make some suggestions regarding hyperparameter selection.

Download Full-text

A Unified View of Causal and Non-causal Feature Selection

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3436891 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-46

Author(s):

Kui Yu ◽

Lin Liu ◽

Jiuyong Li

Keyword(s):

Feature Selection ◽

Bayesian Network ◽

Synthetic Data ◽

Selection Methods ◽

Bayesian Network Model ◽

Real World Data ◽

Feature Sets ◽

Unified View ◽

Optimal Feature ◽

Different Levels

In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.

Download Full-text