Imputing missings in official statistics for general tasks – our vote for distributional accuracy

2021 ◽  
pp. 1-12
Author(s):  
Maria Thurow ◽  
Florian Dumpert ◽  
Burim Ramosaj ◽  
Markus Pauly

In statistical survey analysis, (partial) non-responders are integral elements during data acquisition. Treating missing values during data preparation and data analysis is therefore a non-trivial underpinning. Focusing on the German Structure of Earnings data from the Federal Statistical Office of Germany (DESTATIS), we investigate various imputation methods regarding their imputation accuracy and its impact on parameter estimates in the analysis phase after imputation. Since imputation accuracy measures are not uniquely determined in theory and practice, we study different measures for assessing imputation accuracy: Beyond the most common measures, the normalized-root mean squared error (NRMSE) and the proportion of false classification (PFC), we put a special focus on (distribution) distance measures for assessing imputation accuracy. The aim is to deliver guidelines for correctly assessing distributional accuracy after imputation and the potential effect on parameter estimates such as the mean gross income. Our empirical findings indicate a discrepancy between the NRMSE resp. PFC and distance measures. While the latter measure distributional similarities, NRMSE and PFC focus on data reproducibility. We realize that a low NRMSE or PFC is in general not accompanied by lower distributional discrepancies. However, distributional based measures correspond with more accurate parameter estimates such as mean gross income under the (multiple) imputation scheme.

2018 ◽  
Author(s):  
Md. Bahadur Badsha ◽  
Rui Li ◽  
Boxiang Liu ◽  
Yang I. Li ◽  
Min Xian ◽  
...  

ABSTRACTBackgroundSingle-cell RNA-sequencing (scRNA-seq) is a rapidly evolving technology that enables measurement of gene expression levels at an unprecedented resolution. Despite the explosive growth in the number of cells that can be assayed by a single experiment, scRNA-seq still has several limitations, including high rates of dropouts, which result in a large number of genes having zero read count in the scRNA-seq data, and complicate downstream analyses.MethodsTo overcome this problem, we treat zeros as missing values and develop nonparametric deep learning methods for imputation. Specifically, our LATE (Learning with AuToEncoder) method trains an autoencoder with random initial values of the parameters, whereas our TRANSLATE (TRANSfer learning with LATE) method further allows for the use of a reference gene expression data set to provide LATE with an initial set of parameter estimates.ResultsOn both simulated and real data, LATE and TRANSLATE outperform existing scRNA-seq imputation methods, achieving lower mean squared error in most cases, recovering nonlinear gene-gene relationships, and better separating cell types. They are also highly scalable and can efficiently process over 1 million cells in just a few hours on a GPU.ConclusionsWe demonstrate that our nonparametric approach to imputation based on autoencoders is powerful and highly efficient.


2018 ◽  
Vol 28 (5) ◽  
pp. 1311-1327 ◽  
Author(s):  
Faisal M Zahid ◽  
Christian Heumann

Missing data is a common issue that can cause problems in estimation and inference in biomedical, epidemiological and social research. Multiple imputation is an increasingly popular approach for handling missing data. In case of a large number of covariates with missing data, existing multiple imputation software packages may not work properly and often produce errors. We propose a multiple imputation algorithm called mispr based on sequential penalized regression models. Each variable with missing values is assumed to have a different distributional form and is imputed with its own imputation model using the ridge penalty. In the case of a large number of predictors with respect to the sample size, the use of a quadratic penalty guarantees unique estimates for the parameters and leads to better predictions than the usual Maximum Likelihood Estimation (MLE), with a good compromise between bias and variance. As a result, the proposed algorithm performs well and provides imputed values that are better even for a large number of covariates with small samples. The results are compared with the existing software packages mice, VIM and Amelia in simulation studies. The missing at random mechanism was the main assumption in the simulation study. The imputation performance of the proposed algorithm is evaluated with mean squared imputation error and mean absolute imputation error. The mean squared error ([Formula: see text]), parameter estimates with their standard errors and confidence intervals are also computed to compare the performance in the regression context. The proposed algorithm is observed to be a good competitor to the existing algorithms, with smaller mean squared imputation error, mean absolute imputation error and mean squared error. The algorithm’s performance becomes considerably better than that of the existing algorithms with increasing number of covariates, especially when the number of predictors is close to or even greater than the sample size. Two real-life datasets are also used to examine the performance of the proposed algorithm using simulations.


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2018 ◽  
Vol 30 (12) ◽  
pp. 3227-3258 ◽  
Author(s):  
Ian H. Stevenson

Generalized linear models (GLMs) have a wide range of applications in systems neuroscience describing the encoding of stimulus and behavioral variables, as well as the dynamics of single neurons. However, in any given experiment, many variables that have an impact on neural activity are not observed or not modeled. Here we demonstrate, in both theory and practice, how these omitted variables can result in biased parameter estimates for the effects that are included. In three case studies, we estimate tuning functions for common experiments in motor cortex, hippocampus, and visual cortex. We find that including traditionally omitted variables changes estimates of the original parameters and that modulation originally attributed to one variable is reduced after new variables are included. In GLMs describing single-neuron dynamics, we then demonstrate how postspike history effects can also be biased by omitted variables. Here we find that omitted variable bias can lead to mistaken conclusions about the stability of single-neuron firing. Omitted variable bias can appear in any model with confounders—where omitted variables modulate neural activity and the effects of the omitted variables covary with the included effects. Understanding how and to what extent omitted variable bias affects parameter estimates is likely to be important for interpreting the parameters and predictions of many neural encoding models.


2021 ◽  
Vol 19 (1) ◽  
pp. 2-20
Author(s):  
Piyush Kant Rai ◽  
Alka Singh ◽  
Muhammad Qasim

This article introduces calibration estimators under different distance measures based on two auxiliary variables in stratified sampling. The theory of the calibration estimator is presented. The calibrated weights based on different distance functions are also derived. A simulation study has been carried out to judge the performance of the proposed estimators based on the minimum relative root mean squared error criterion. A real-life data set is also used to confirm the supremacy of the proposed method.


2020 ◽  
Vol 17 (173) ◽  
pp. 20200886
Author(s):  
L. Mihaela Paun ◽  
Mitchel J. Colebank ◽  
Mette S. Olufsen ◽  
Nicholas A. Hill ◽  
Dirk Husmeier

This study uses Bayesian inference to quantify the uncertainty of model parameters and haemodynamic predictions in a one-dimensional pulmonary circulation model based on an integration of mouse haemodynamic and micro-computed tomography imaging data. We emphasize an often neglected, though important source of uncertainty: in the mathematical model form due to the discrepancy between the model and the reality, and in the measurements due to the wrong noise model (jointly called ‘model mismatch’). We demonstrate that minimizing the mean squared error between the measured and the predicted data (the conventional method) in the presence of model mismatch leads to biased and overly confident parameter estimates and haemodynamic predictions. We show that our proposed method allowing for model mismatch, which we represent with Gaussian processes, corrects the bias. Additionally, we compare a linear and a nonlinear wall model, as well as models with different vessel stiffness relations. We use formal model selection analysis based on the Watanabe Akaike information criterion to select the model that best predicts the pulmonary haemodynamics. Results show that the nonlinear pressure–area relationship with stiffness dependent on the unstressed radius predicts best the data measured in a control mouse.


2018 ◽  
Vol 23 (3) ◽  
pp. 239-254 ◽  
Author(s):  
Amir Qamar ◽  
Mark Hall

PurposeThe purpose of this paper is to robustly establish whether firms are implementing Lean or Agile production in the automotive supply chain (SC) and, by drawing on contingency theory (CT) as our theoretical lens, independently determine whether Lean and Agile firms can be distinguished based upon contextual factors.Design/methodology/approachPrimary quantitative data from 140 firms in the West Midlands (UK) automotive industry were obtained via a constructed survey. Analysis incorporated the use of logistic regressions to calculate the probability of Lean and Agile organisations belonging to different groups amongst the contextual factors investigated.FindingsLean and Agile firms co-exist in the automotive SC and Lean firms were found to be at higher tiers of the SC, while Agile firms were found to be at lower tiers.Originality/valueThe originality of this study lies within the novel methodological attempt used to distinguish Lean and Agile production, based upon the contextual factors investigated. Not only is the importance of CT theoretically approved, but “received wisdom” within SC management is also contested. Extant literature propagates that the automotive SC is comprised of organisations that predominantly adopt Lean production methods, and that in SCs comprised of both Lean and Agile organisations, the firms closer to the customer will adopt more flexible (Agile) practices, while those that operate upstream will adopt more efficient (Lean) practices. The findings from this study have implications for theory and practice, as Lean and Agile firms can be found in the automotive SC without any relationship to the value-adding process. To speculate as to why the findings contest existing views, resource dependence theory and, more specifically, a power perspective, was invoked. The authors provide readers with a new way of thinking concerning complicated SCs and urge that the discipline of SC management adopts a “fourth” SC model, depicting a new Lean and Agile SC configuration.


2021 ◽  
pp. e1-e9
Author(s):  
Elizabeth A. Erdman ◽  
Leonard D. Young ◽  
Dana L. Bernson ◽  
Cici Bauer ◽  
Kenneth Chui ◽  
...  

Objectives. To develop an imputation method to produce estimates for suppressed values within a shared government administrative data set to facilitate accurate data sharing and statistical and spatial analyses. Methods. We developed an imputation approach that incorporated known features of suppressed Massachusetts surveillance data from 2011 to 2017 to predict missing values more precisely. Our methods for 35 de-identified opioid prescription data sets combined modified previous or next substitution followed by mean imputation and a count adjustment to estimate suppressed values before sharing. We modeled 4 methods and compared the results to baseline mean imputation. Results. We assessed performance by comparing root mean squared error (RMSE), mean absolute error (MAE), and proportional variance between imputed and suppressed values. Our method outperformed mean imputation; we retained 46% of the suppressed value’s proportional variance with better precision (22% lower RMSE and 26% lower MAE) than simple mean imputation. Conclusions. Our easy-to-implement imputation technique largely overcomes the adverse effects of low count value suppression with superior results to simple mean imputation. This novel method is generalizable to researchers sharing protected public health surveillance data. (Am J Public Health. Published online ahead of print September 16, 2021: e1–e9. https://doi.org/10.2105/AJPH.2021.306432 )


2010 ◽  
Vol 6 (3) ◽  
pp. 1-10 ◽  
Author(s):  
Shichao Zhang

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.


1988 ◽  
Vol 25 (3) ◽  
pp. 301-307
Author(s):  
Wilfried R. Vanhonacker

Estimating autoregressive current effects models is not straightforward when observations are aggregated over time. The author evaluates a familiar iterative generalized least squares (IGLS) approach and contrasts it to a maximum likelihood (ML) approach. Analytic and numerical results suggest that (1) IGLS and ML provide good estimates for the response parameters in instances of positive serial correlation, (2) ML provides superior (in mean squared error) estimates for the serial correlation coefficient, and (3) IGLS might have difficulty in deriving parameter estimates in instances of negative serial correlation.


Sign in / Sign up

Export Citation Format

Share Document