data generating process
Recently Published Documents


TOTAL DOCUMENTS

131
(FIVE YEARS 59)

H-INDEX

12
(FIVE YEARS 2)

2021 ◽  
Vol 12 ◽  
Author(s):  
Zhao Yang ◽  
C. Mary Schooling ◽  
Man Ki Kwok

Selection bias is increasingly acknowledged as a limitation of Mendelian randomization (MR). However, few methods exist to assess this issue. We focus on two plausible causal structures relevant to MR studies and illustrate the data-generating process underlying selection bias via simulation studies. We conceptualize the use of control exposures to validate MR estimates derived from selected samples by detecting potential selection bias and reproducing the exposure–outcome association of primary interest based on subject matter knowledge. We discuss the criteria for choosing the control exposures. We apply the proposal in an MR study investigating the potential effect of higher transferrin with stroke (including ischemic and cardioembolic stroke) using transferrin saturation and iron status as control exposures. Theoretically, selection bias affects associations of genetic instruments with the outcome in selected samples, violating the exclusion-restriction assumption and distorting MR estimates. Our applied example showing inconsistent effects of genetically predicted higher transferrin and higher transferrin saturation on stroke suggests the potential selection bias. Furthermore, the expected associations of genetically predicted higher iron status on stroke and longevity indicate no systematic selection bias. The routine use of control exposures in MR studies provides a valuable tool to validate estimated causal effects. Like the applied example, an antagonist, decoy, or exposure with similar biological activity as the exposure of primary interest, which has the same potential selection bias sources as the exposure–outcome association, is suggested as the control exposure. An additional or a validated control exposure with a well-established association with the outcome is also recommended to explore possible systematic selection bias.


Author(s):  
Shahryar Minhas ◽  
Cassy Dorff ◽  
Max B. Gallop ◽  
Margaret Foster ◽  
Howard Liu ◽  
...  

Abstract International relations scholarship concerns dyads, yet standard modeling approaches fail to adequately capture the data generating process behind dyadic events and processes. As a result, they suffer from biased coefficients and poorly calibrated standard errors. We show how a regression-based approach, the Additive and Multiplicative Effects (AME) model, can be used to account for the inherent dependencies in dyadic data and glean substantive insights in the interrelations between actors. First, we conduct a simulation to highlight how the model captures dependencies and show that accounting for these processes improves our ability to conduct inference on dyadic data. Second, we compare the AME model to approaches used in three prominent studies from recent international relations scholarship. For each study, we find that compared to AME, the modeling approach used performs notably worse at capturing the data generating process. Further, conventional methods misstate the effect of key variables and the uncertainty in these effects. Finally, AME outperforms standard approaches in terms of out-of-sample fit. In sum, our work shows the consequences of failing to take the dependencies inherent to dyadic data seriously. Most importantly, by better modeling the data generating process underlying political phenomena, the AME framework improves scholars’ ability to conduct inferential analyses on dyadic data.


Author(s):  
Cassy Dorff ◽  
Max Gallop ◽  
Shahryar Minhas

Abstract Spatial interdependencies commonly drive the spread of violence in civil conflict. To address such interdependence, scholars often use spatial lags to model the diffusion of violence, but this requires an explicit operationalization of the connectivity matrices that represent the spread of conflict. Unfortunately, in many cases, there are multiple competing processes that facilitate the spread of violence making it difficult to identify the true data-generating process. We show how a network-driven methodology can allow us to account for the spread of violence, even in the cases where we cannot directly measure the factors that drive diffusion. To do so, we estimate a latent connectivity matrix that captures a variety of possible diffusion patterns. We use this procedure to study intrastate conflict in eight conflict-prone countries and show how our framework enables substantially better predictive performance than canonical spatial-lag measures. We also investigate the circumstances under which canonical spatial lags suffice and those under which a latent network approach is beneficial.


2021 ◽  
Vol 2042 (1) ◽  
pp. 012035
Author(s):  
Evgenii Genov ◽  
Stefanos Petridis ◽  
Petros Iliadis ◽  
Nikos Nikopoulos ◽  
Thierry Coosemans ◽  
...  

Abstract A reliable and accurate load forecasting method is key to successful energy management of smart grids. Due to the non-linear relations in data generating process and data availability issues, load forecasting remains a challenging task. Here, we investigate the application of feed forward artificial neural networks, recurrent neural networks and crosslearning methods for day-ahead and three days-ahead load forecasting. The effectiveness of the proposed methods is evaluated against a statistical benchmark, using multiple accuracy metrics. The test data sets are high resolution multi-seasonal time series of electricity demand of buildings in Belgium, Canada and the UK from private measurements and open access sources. Both FFNN and RNN methods show competitive results on benchmarking datasets. Best method varies depending on the accuracy metric selected. The use of cross-learning in fitting a global RNN model has an improvement on the final accuracy.


2021 ◽  
Author(s):  
Aaron C Miller ◽  
Joseph E Cavanaugh ◽  
Alan T Arakkal ◽  
Scott H Koeneman ◽  
Philip M Polgreen

The incidence of diagnostic delays is unknown for many diseases and particular healthcare settings. Many existing methods to identify diagnostic delays are resource intensive or inapplicable to various diseases or settings. In this paper we propose a comprehensive framework to estimate the frequency of missed diagnostic opportunities for a given disease using real-world longitudinal data sources. We start by providing a conceptual model of the disease-diagnostic, data-generating process. We then propose a simulation-based method to estimate measures of the frequency of missed diagnostic opportunities and duration of delays. This approach is specifically designed to identify missed diagnostic opportunities based on signs and symptoms that occur prior to an initial diagnosis, while accounting for expected patterns of healthcare that may appear as coincidental symptoms. Three different simulation algorithms are described for implementing this approach. We summarize estimation procedures that may be used to parameterize the simulation. Finally, we apply our approach to the diseases of tuberculosis, acute myocardial infarction, and stroke and evaluate the estimated frequency and duration of diagnostic delays for these diseases. Our approach can be customized to fit a range of disease and we summarize how the choice of simulation algorithm may impact the resulting estimates.


2021 ◽  
Vol 20 (3) ◽  
pp. 425-449
Author(s):  
Haruka Murayama ◽  
Shota Saito ◽  
Yuji Iikubo ◽  
Yuta Nakahara ◽  
Toshiyasu Matsushima

AbstractPrediction based on a single linear regression model is one of the most common way in various field of studies. It enables us to understand the structure of data, but might not be suitable to express the data whose structure is complex. To express the structure of data more accurately, we make assumption that the data can be divided in clusters, and has a linear regression model in each cluster. In this case, we can assume that each explanatory variable has their own role; explaining the assignment to the clusters, explaining the regression to the target variable, or being both of them. Introducing probabilistic structure to the data generating process, we derive the optimal prediction under Bayes criterion and the algorithm which calculates it sub-optimally with variational inference method. One of the advantages of our algorithm is that it automatically weights the probabilities of being each number of clusters in the process of the algorithm, therefore it solves the concern about selection of the number of clusters. Some experiments are performed on both synthetic and real data to demonstrate the above advantages and to discover some behaviors and tendencies of the algorithm.


Mathematics ◽  
2021 ◽  
Vol 9 (16) ◽  
pp. 1853
Author(s):  
Alina Bărbulescu ◽  
Cristian Ștefan Dumitriu

Artificial intelligence (AI) methods are interesting alternatives to classical approaches for modeling financial time series since they relax the assumptions imposed on the data generating process by the parametric models and do not impose any constraint on the model’s functional form. Even if many studies employed these techniques for modeling financial time series, the connection of the models’ performances with the statistical characteristics of the data series has not yet been investigated. Therefore, this research aims to study the performances of Gene Expression Programming (GEP) for modeling monthly and weekly financial series that present trend and/or seasonality and after the removal of each component. It is shown that series normality and homoskedasticity do not influence the models’ quality. The trend removal increases the models’ performance, whereas the seasonality elimination results in diminishing the goodness of fit. Comparisons with ARIMA models built are also provided.


Author(s):  
Benjie Wang ◽  
Clare Lyle ◽  
Marta Kwiatkowska

Robustness of decision rules to shifts in the data-generating process is crucial to the successful deployment of decision-making systems. Such shifts can be viewed as interventions on a causal graph, which capture (possibly hypothetical) changes in the data-generating process, whether due to natural reasons or by the action of an adversary. We consider causal Bayesian networks and formally define the interventional robustness problem, a novel model-based notion of robustness for decision functions that measures worst-case performance with respect to a set of interventions that denote changes to parameters and/or causal influences. By relying on a tractable representation of Bayesian networks as arithmetic circuits, we provide efficient algorithms for computing guaranteed upper and lower bounds on the interventional robustness probabilities. Experimental results demonstrate that the methods yield useful and interpretable bounds for a range of practical networks, paving the way towards provably causally robust decision-making systems.


Author(s):  
Kyle L. Marquardt ◽  
Daniel Pemstein

Abstract Models for converting expert-coded data to estimates of latent concepts assume different data-generating processes (DGPs). In this paper, we simulate ecologically valid data according to different assumptions, and examine the degree to which common methods for aggregating expert-coded data (1) recover true values and (2) construct appropriate coverage intervals. We find that the mean and both hierarchical Aldrich–McKelvey (A–M) scaling and hierarchical item-response theory (IRT) models perform similarly when expert error is low; the hierarchical latent variable models (A-M and IRT) outperform the mean when expert error is high. Hierarchical A–M and IRT models generally perform similarly, although IRT models are often more likely to include true values within their coverage intervals. The median and non-hierarchical latent variable models perform poorly under most assumed DGPs.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Peter A. Jones ◽  
Vincent Reitano ◽  
J.S. Butler ◽  
Robert Greer

PurposePublic management researchers commonly model dichotomous dependent variables with parametric methods despite their relatively strong assumptions about the data generating process. Without testing for those assumptions and consideration of semiparametric alternatives, such as maximum score, estimates might be biased, or predictions might not be as accurate as possible.Design/methodology/approachTo guide researchers, this paper provides an evaluative framework for comparing parametric estimators with semiparametric and nonparametric estimators for dichotomous dependent variables. To illustrate the framework, the article estimates the factors associated with the passage of school district bond referenda in all Texas school districts from 1998 to 2015.FindingsEstimates show that the correct prediction of a bond passing increases from 77.2 to 78%, with maximum score estimation relative to a commonly used parametric alternative. While this is a small increase, it is meaningful in comparison to the random prediction base model.Originality/valueFuture research modeling any dichotomous dependent variable can use the framework to identify the most appropriate estimator and relevant statistical programs.


Sign in / Sign up

Export Citation Format

Share Document