Multiple Imputation with Survey Weights: A Multilevel Approach

Abstract Multiple imputation is now well established as a practical and flexible method for analyzing partially observed data, particularly under the missing at random assumption. However, when the substantive model is a weighted analysis, there is concern about the empirical performance of Rubin’s rules and also about how to appropriately incorporate possible interaction between the weights and the distribution of the study variables. One approach that has been suggested is to include the weights in the imputation model, potentially also allowing for interactions with the other variables. We show that the theoretical criterion justifying this approach can be approximately satisfied if we stratify the weights to define level-two units in our data set and include random intercepts in the imputation model. Further, if we let the covariance matrix of the variables have a random distribution across the level-two units, we also allow imputation to reflect any interaction between weight strata and the distribution of the variables. We evaluate our proposal in a number of simulation scenarios, showing it has promising performance both in terms of coverage levels of the model parameters and bias of the associated Rubin’s variance estimates. We illustrate its application to a weighted analysis of factors predicting reception-year readiness in children in the UK Millennium Cohort Study.

Download Full-text

Flexible, Free Software for Multilevel Multiple Imputation: A Review of Blimp and jomo

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998619858624 ◽

2019 ◽

Vol 44 (5) ◽

pp. 625-641

Author(s):

Timothy Hayes

Keyword(s):

Multiple Imputation ◽

Interaction Model ◽

Missing At Random ◽

Analysis Model ◽

Imputation Model ◽

Model Following ◽

Software Packages ◽

Random Intercept ◽

Impute Data ◽

Multilevel Multiple Imputation

Multiple imputation is a popular method for addressing data that are presumed to be missing at random. To obtain accurate results, one’s imputation model must be congenial to (appropriate for) one’s intended analysis model. This article reviews and demonstrates two recent software packages, Blimp and jomo, to multiply impute data in a manner congenial with three prototypical multilevel modeling analyses: (1) a random intercept model, (2) a random slope model, and (3) a cross-level interaction model. Following these analysis examples, I review and discuss both software packages.

Download Full-text

The Radioactivity Depth Analysis Tool (RADPAT)

ASME 2009 12th International Conference on Environmental Remediation and Radioactive Waste Management, Volume 2 ◽

10.1115/icem2009-16144 ◽

2009 ◽

Author(s):

B. Alan Shippen ◽

Malcolm J. Joyce

Keyword(s):

Silica Sand ◽

Model Parameters ◽

Analysis Tool ◽

Data Set ◽

Plant Materials ◽

Depth Analysis ◽

Absolute Efficiency ◽

The Uk ◽

High Degree ◽

Detector Type

The Radioactive Depth Analysis Tool (RADPAT) is a PhD bursary project currently being undertaken at Lancaster University in the UK. The RADPAT project involves the development of nuclear instrumentation capable of ascertaining depth of radioactive contamination within legacy plant materials such as concrete. This paper evaluates the merits of two types of detector; sodium iodide (NaI(Tl)) and cadmium zinc telluride (CZT), both of which have been identified as possible solutions for the final RADPAT detector. A bespoke concrete phantom has been developed to allow a set depth of simulated contamination to be obtained with a low measurement error within a concrete analogue: silica sand. Utilising this phantom, in combination with the selected detectors, a set of measurements have been obtained varied with increasing depth of caesium-137 contamination. By comparing the relative attenuation of the x-ray and γ-ray photo-peaks from the data-set to that suggested by a differential attenuation law, a set of model parameters can be obtained. This model, once calibrated, describes the contact depth of contamination with the relative intensity of the peaks in a measured spectrum with a high degree of accuracy. Thus, this technique allows for a set of measurements across the surface of a given material to obtain the inherent distribution of the depth of caesium-137 contamination. This paper is primarily interested in the ability of each detector type to derive the attenuation model, paying particular attention to the associated statistical uncertainty of the fitted parameters and thus the error in the derived depth. The paper describes the contributing effects of the inherent properties of each detector; effects such as their energy resolution, absolute efficiency as well as peak-to-Compton ratio. Finally a commentary on the applicability of each selected detector type is presented, including a comment on the extension the technique to a more generic, real world solution.

Download Full-text

Considerations for using multiple imputation in propensity score-weighted analysis

10.1101/2021.06.30.21259793 ◽

2021 ◽

Author(s):

Andreas Halgreen Eiset ◽

Morten Frydenberg

Keyword(s):

Missing Data ◽

Propensity Score ◽

Confidence Interval ◽

Multiple Imputation ◽

Computational Time ◽

Standardized Mortality Ratio ◽

Data Set ◽

Propensity Score Model ◽

Weighted Analysis ◽

Percentile Confidence Interval

We present our considerations for using multiple imputation to account for missing data in propensity score-weighted analysis with bootstrap percentile confidence interval. We outline the assumptions underlying each of the methods and discuss the methodological and practical implications of our choices and briefly point to alternatives. We made a number of choices a priori for example to use logistic regression-based propensity scores to produce standardized mortality ratio-weights and Substantive Model Compatible-Full Conditional Specification to multiply impute missing data (given no violation of underlying assumptions). We present a methodology to combine these methods by choosing the propensity score model based on covariate balance, using this model as the substantive model in the multiple imputation, producing and averaging the point estimates from each multiple imputed data set to give the estimate of association and computing the percentile confidence interval by bootstrapping. The described methodology is demanding in both work-load and in computational time, however, we do not consider the prior a draw-back: it makes some of the underlying assumptions explicit and the latter may be a nuisance that will diminish with faster computers and better implementations.

Download Full-text

Practical Data Synthesis for Large Samples

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v7i3.407 ◽

2018 ◽

Vol 7 (3) ◽

pp. 67-97 ◽

Cited By ~ 11

Author(s):

Gillian M Raab ◽

Beata Nowok ◽

Chris Dibben

Keyword(s):

Longitudinal Study ◽

Synthetic Data ◽

Predictive Distribution ◽

Data Sets ◽

Posterior Predictive Distribution ◽

Data Set ◽

Data Synthesis ◽

Large Samples ◽

The Uk ◽

Variance Estimates

We describe results on the creation and use of synthetic data that were derived in the context of a project to make synthetic extracts available for users of the UK Longitudinal Studies. A critical review of existing methods of inference from large synthetic data sets is presented. We introduce new variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data and can be used with a single synthetic data set. We make recommendations on how to synthesise data based on these results. The practical consequences of these results are illustrated with an example from the Scottish Longitudinal Study.

Download Full-text

Multiple imputation with missing data indicators

Statistical Methods in Medical Research ◽

10.1177/09622802211047346 ◽

2021 ◽

pp. 096228022110473

Author(s):

Lauren J Beesley ◽

Irina Bondarenko ◽

Michael R Elliot ◽

Allison W Kurian ◽

Steven J Katz ◽

...

Keyword(s):

Multiple Imputation ◽

Regression Models ◽

Missing Values ◽

Final Analysis ◽

Missing At Random ◽

Imputation Model ◽

General Technique ◽

Breast Cancer Study ◽

Sequential Regression ◽

Imputation Procedure

Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation, also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the sequential regression multiple imputation imputation procedure to handle missingness not at random in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully observed variables. We provide algebraic justification for several generalizations of standard sequential regression multiple imputation using Taylor series and other approximations of the target imputation distribution under missingness not at random. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the missingness not at random missingness model and observed data. In a simulation study, we demonstrate that the proposed sequential regression multiple imputation modifications result in reduced bias in the final analysis compared to standard sequential regression multiple imputation, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.

Download Full-text

Impact of enzalutamide (ENZA) vs. bicalutamide (BIC) on health-related quality of life (HRQoL) of patients (pts) with castration-resistant prostate cancer (CRPC): STRIVE study.

Journal of Clinical Oncology ◽

10.1200/jco.2018.36.6_suppl.234 ◽

2018 ◽

Vol 36 (6_suppl) ◽

pp. 234-234

Author(s):

Raoul Concepcion ◽

Andrew J. Armstrong ◽

Lawrence Ivan Karsh ◽

Stefan Holmstrom ◽

Cristina Ivanescu ◽

...

Keyword(s):

Prostate Cancer ◽

Multiple Imputation ◽

Lower Risk ◽

Missing At Random ◽

Minimum Clinically Important Difference ◽

Well Being ◽

Castration Resistant Prostate Cancer ◽

Imputation Model ◽

Piecewise Exponential ◽

Outcome Index

234 Background: In STRIVE pts with CRPC (M0 n = 139; M1 n = 257), median time to 10-point decrease from baseline in FACT-P total for ENZA vs. BIC was 8.4 vs. 8.3 months (hazard ratio [HR] 0.91; 95% confidence interval [CI] 0.70, 1.19; p = 0.49). That assumed missing data was missing at random (MAR) and censored pts with no deterioration in FACT-P at last assessment. As HRQoL may worsen after progression/adverse events, for all STRIVE pts we replaced the MAR assumption with assumptions more likely to reflect clinically plausible HRQoL decline. Methods: Analyses of HRQoL decline (minimum clinically important difference or higher decrease in FACT-P vs. baseline) used a missing not at random (MNAR) assumption using a pattern mixture model (PMM) via sequential modeling with multiple imputation when imputation varies by reason of treatment discontinuation. Analysis of time to first clinically meaningful deterioration vs. baseline used a piecewise exponential survival multiple imputation model with reason-specific ∆ adjustment patterns similar to PMM analysis. Results: PMM analysis showed differences at week 61 in mean HRQoL change from baseline favoring ENZA vs. BIC for 7 of 10 scores: physical (PWB), functional, emotional (EWB), and social (SWB) well-being; FACT-P trial outcome index; FACT-G total; FACT-P total (all clinically meaningful except PWB). In the piecewise exponential survival imputation model, ENZA had a significantly lower risk of first deterioration in FACT-P total (0.76 [0.60, 0.95]), FACT-G total (0.66 [0.52, 0.83]), Prostate Cancer Subscale (PCS) pain-related (0.78 [0.62, 0.97]), SWB (0.49 [0.38, 0.64]), and EWB (0.58 [0.45, 0.75]) vs. BIC. For remaining domain scores, ENZA reduces risk of first deterioration (HR < 1) but the 95% CI includes 1 (which means not significant); sensitivity analysis showed similar results. Conclusions: In STRIVE pts, declines in all FACT-P scores were smaller for ENZA vs. BIC up to week 61. Comparison of change from baseline at week 61 favored ENZA for 7 of 10 scores (6 clinically meaningful). ENZA had a significantly lower risk of first deterioration in FACT-P or FACT-G total, PCS pain-related, EWB, and SWB. Clinical trial information: NCT01664923.

Download Full-text

Data Missingness Patterns in Homicide Datasets: An Applied Test on a Primary Data Set

Violence and Victims ◽

10.1891/vv-d-17-00189 ◽

2020 ◽

Vol 35 (4) ◽

pp. 589-614

Author(s):

Melanie-Angela Neuilly ◽

Ming-Li Hsieh ◽

Alex Kigerl ◽

Zachary K. Hamilton

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing At Random ◽

Primary Data ◽

Random Pattern ◽

Validity Of Results ◽

Data Set ◽

Missing Not At Random ◽

Listwise Deletion ◽

The Relationship

Research on homicide missing data conventionally posits a Missing At Random pattern despite the relationship between missing data and clearance. The latter, however, cannot be satisfactorily modeled using variables traditionally available in homicide datasets. For this reason, it has been argued that missingness in homicide data follows a Nonignorable pattern instead. Hence, the use of multiple imputation strategies as recommended in the field for ignorable patterns would thus pose a threat to the validity of results obtained in such a way. This study examines missing data mechanisms by using a set of primary data collected in New Jersey. After comparing Listwise Deletion, Multiple Imputation, Propensity Score Matching, and Log-Multiplicative Association Models, our findings underscore that data in homicide datasets are indeed Missing Not At Random.

Download Full-text

Statistical Approaches to Decreasing the Discrepancy of Non-detects in qPCR Data

10.1101/231621 ◽

2017 ◽

Author(s):

Valeriia Sherina ◽

Helene R. McMurray ◽

Winslow Powers ◽

Hartmut Land ◽

Tanzy M.T. Love ◽

...

Keyword(s):

Gene Expression ◽

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Limit Of Detection ◽

Model Parameters ◽

Residual Variance ◽

Modeling Framework ◽

Data Set ◽

Qpcr Data

AbstractQuantitative real-time PCR (qPCR) is one of the most widely used methods to measure gene expression. Despite extensive research in qPCR laboratory protocols, normalization, and statistical analysis, little attention has been given to qPCR non-detects – those reactions failing to produce a minimum amount of signal. While most current software replaces these non-detects with a value representing the limit of detection, recent work suggests that this introduces substantial bias in estimation of both absolute and differential expression. Recently developed single imputation procedures, while better than previously used methods, underestimate residual variance, which can lead to anti-conservative inference. We propose to treat non-detects as non-random missing data, model the missing data mechanism, and use this model to impute missing values or obtain direct estimates of relevant model parameters. To account for the uncertainty inherent in the imputation, we propose a multiple imputation procedure, which provides a set of plausible values for each non-detect. In the proposed modeling framework, there are three sources of uncertainty: parameter estimation, the missing data mechanism, and measurement error. All three sources of variability are incorporated in the multiple imputation and direct estimation algorithms. We demonstrate the applicability of these methods on three real qPCR data sets and perform an extensive simulation study to assess model sensitivity to misspecification of the missing data mechanism, to the number of replicates within the sample, and to the overall size of the data set. The proposed methods result in unbiased estimates of the model parameters; therefore, these approaches may be beneficial when estimating both absolute and differential gene expression. The developed methods are implemented in the R/Bioconductor package nondetects. The statistical methods introduced here reduce discrepancies in gene expression values derived from qPCR experiments, providing more confidence in generating scientific hypotheses and performing downstream analysis.

Download Full-text

Childhood Intelligence Predicts Adult Trait Openness

Journal of Individual Differences ◽

10.1027/1614-0001/a000194 ◽

2016 ◽

Vol 37 (2) ◽

pp. 105-111 ◽

Cited By ~ 6

Author(s):

Adrian Furnham ◽

Helen Cheng

Keyword(s):

Structural Equation Modeling ◽

Structural Equation ◽

Social Background ◽

Study Data ◽

Equation Modeling ◽

Direct Effects ◽

Data Set ◽

National Child Development Study ◽

Nationally Representative ◽

The Uk

Abstract. This study used a longitudinal data set of 5,672 adults followed for 50 years to determine the factors that influence adult trait Openness-to-Experience. In a large, nationally representative sample in the UK (the National Child Development Study), data were collected at birth, in childhood (age 11), adolescence (age 16), and adulthood (ages 33, 42, and 50) to examine the effects of family social background, childhood intelligence, school motivation during adolescence, education, and occupation on the personality trait Openness assessed at age 50 years. Structural equation modeling showed that parental social status, childhood intelligence, school motivation, education, and occupation all had modest, but direct, effects on trait Openness, among which childhood intelligence was the strongest predictor. Gender was not significantly associated with trait Openness. Limitations and implications of the study are discussed.

Download Full-text

EXPONENTIATED HALF-LOGISTIC LOMAX DISTRIBUTION WITH PROPERTIES AND APPLICATION

NED University Journal of Research ◽

10.35453/nedjr-ascn-2018-0033 ◽

2019 ◽

Vol XVI (2) ◽

pp. 1-11

Author(s):

Farrukh Jamal ◽

Hesham Mohammed Reyad ◽

Soha Othman Ahmed ◽

Muhammad Akbar Ali Shah ◽

Emrah Altun

Keyword(s):

Real Data ◽

Continuous Model ◽

Model Parameters ◽

Data Set ◽

Lomax Distribution ◽

Mathematical Properties ◽

Proposed Model ◽

Probability Weighted Moment ◽

Record Statistics ◽

Maximum Likelihood Criterion

A new three-parameter continuous model called the exponentiated half-logistic Lomax distribution is introduced in this paper. Basic mathematical properties for the proposed model were investigated which include raw and incomplete moments, skewness, kurtosis, generating functions, Rényi entropy, Lorenz, Bonferroni and Zenga curves, probability weighted moment, stress strength model, order statistics, and record statistics. The model parameters were estimated by using the maximum likelihood criterion and the behaviours of these estimates were examined by conducting a simulation study. The applicability of the new model is illustrated by applying it on a real data set.

Download Full-text