scholarly journals Dealing with multivariate missing data in principal component analyses and subsequent model estimation: a two-step worked example using data from the Canadian Longitudinal Study of Aging

2019 ◽  
Author(s):  
Anni Hämäläinen ◽  
Paul Mick

Missing data can be a significant problem for statistical inference in many disciplines when information is not missing completely at random. In the worst case, it can lead to biased results when participants or subjects with certain characteristics contribute more data than other participants. Multiple imputation methods can be used to alleviate the loss of sample size and correct for this potential bias. Multiple imputation entails filling in the missing data using information from the same and other participants on the variables of interest and potentially other available data that correlate with the variables of interest. The missing data estimates and uncertainty associated with their estimation may then be taken into account in statistical inference from those variables. A complication may arise when using compound variables, such as principal component loadings (PC), which draw on a number of raw variables that themselves have non-overlapping missing data. Here, we propose a sequential multiple imputation approach to facilitate the use of all available data in the raw variables contained in compound variables in a way that conforms to the specifications of the multiple imputation framework. We first use multiple imputation to impute missing data for the subset of raw variables used in a principal component analysis (PCA) and perform the PCA with the imputed data; then, use the factor loadings to calculate PC scores for each individual with complete raw data. Finally, we include these PC scores as part of a global multiple imputation approach to estimate a final statistical model. We demonstrate (including annotated Stata code) the use of this approach by examining which sensory, health, social and cognitive factors explain self-reported sensory difficulties in the Canadian Longitudinal Study of Aging (CLSA) Comprehensive Cohort. The proposed sequential multiple imputation approach allows us to deal with the issue of having large cumulative amount of data that is missing (not completely at random) among a large number of variables, including composite cognitive scores derived from a battery of cognitive tests. We examine the resulting parameter estimates using a range of recommended diagnostic tools to highlight the potential and consequences of the approach to the statistical results.

2021 ◽  
Author(s):  
Adrienne D. Woods ◽  
Pamela Davis-Kean ◽  
Max Andrew Halvorson ◽  
Kevin Michael King ◽  
Jessica A. R. Logan ◽  
...  

A common challenge in developmental research is the amount of incomplete and missing data that occurs from respondents failing to complete tasks or questionnaires, as well as from disengaging from the study (i.e., attrition). This missingness can lead to biases in parameter estimates and, hence, in the interpretation of findings. These biases can be addressed through statistical techniques that adjust for missing data, such as multiple imputation. Although this technique is highly effective, it has not been widely adopted by developmental scientists given barriers such as lack of training or misconceptions about imputation methods and instead utilizing default methods within software like listwise deletion. This manuscript is intended to provide practical guidelines for developmental researchers to follow when examining their data for missingness, making decisions about how to handle that missingness, and reporting the extent of missing data biases and specific multiple imputation procedures in publications.


2019 ◽  
Vol 80 (1) ◽  
pp. 41-66 ◽  
Author(s):  
Dexin Shi ◽  
Taehun Lee ◽  
Amanda J. Fairchild ◽  
Alberto Maydeu-Olivares

This study compares two missing data procedures in the context of ordinal factor analysis models: pairwise deletion (PD; the default setting in Mplus) and multiple imputation (MI). We examine which procedure demonstrates parameter estimates and model fit indices closer to those of complete data. The performance of PD and MI are compared under a wide range of conditions, including number of response categories, sample size, percent of missingness, and degree of model misfit. Results indicate that both PD and MI yield parameter estimates similar to those from analysis of complete data under conditions where the data are missing completely at random (MCAR). When the data are missing at random (MAR), PD parameter estimates are shown to be severely biased across parameter combinations in the study. When the percentage of missingness is less than 50%, MI yields parameter estimates that are similar to results from complete data. However, the fit indices (i.e., χ2, RMSEA, and WRMR) yield estimates that suggested a worse fit than results observed in complete data. We recommend that applied researchers use MI when fitting ordinal factor models with missing data. We further recommend interpreting model fit based on the TLI and CFI incremental fit indices.


2015 ◽  
Author(s):  
Shinichi Nakagawa ◽  
Pierre de Villemereuil

Phylogenetic comparative methods (PCMs), especially ones based on linear models, have played a central role in understanding species’ trait evolution. These methods, however, usually assume that phylogenetic trees are known without error or uncertainty, but this assumption is most likely incorrect. So far, Markov chain Monte Carlo, MCMC-based Bayesian methods have successfully been deployed to account for such phylogenetic uncertainty in PCMs. Yet, the use of these methods seems to have been limited, probably due to difficulties in their implementation. Here, we propose an approach with which phylogenetic uncertainty is incorporated in a simple, readily implementable and reliable manner. Our approach uses Rubin’s rules, which are an integral part of a standard multiple imputation procedure, often employed to recover missing data. In our case, we see the true phylogenetic tree as a missing piece of data, and apply Rubin’s rules to amalgamate parameter estimates from a number of models using a set of phylogenetic trees (e.g. a Bayesian posterior distribution of phylogenetic trees). Using a simulation study, we demonstrate that our approach using Rubin’s rules performs better in accounting for phylogenetic uncertainty than alternative methods such as MCMC-based Bayesian and Akaike information criterion, AIC-based model averaging approaches; that is, on average, our approach has the best 95% confidence/credible interval coverage among all. A unique property of the multiple imputation procedure is that the index, named ‘relative efficiency’, could be used to quantify the number of trees required for incorporating phylogenetic uncertainty. Thus, by using the relative efficiency, we show the required tree number is surprisingly small (~50 trees) at least in our simulation. In addition to these advantages above, our approach could be combined seamlessly with PCMs that utilize multiple imputation to recover missing data. Given the ubiquity of missing data, it is likely that the use of the multiple imputation procedure with Rubin’s rules will be popular to deal with phylogenetic uncertainty as well as missing data in comparative data.


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Danielle M. Rodgers ◽  
Ross Jacobucci ◽  
Kevin J. Grimm

Decision trees (DTs) is a machine learning technique that searches the predictor space for the variable and observed value that leads to the best prediction when the data are split into two nodes based on the variable and splitting value. The algorithm repeats its search within each partition of the data until a stopping rule ends the search. Missing data can be problematic in DTs because of an inability to place an observation with a missing value into a node based on the chosen splitting variable. Moreover, missing data can alter the selection process because of its inability to place observations with missing values. Simple missing data approaches (e.g., listwise deletion, majority rule, and surrogate split) have been implemented in DT algorithms; however, more sophisticated missing data techniques have not been thoroughly examined. We propose a modified multiple imputation approach to handling missing data in DTs, and compare this approach with simple missing data approaches as well as single imputation and a multiple imputation with prediction averaging via Monte Carlo Simulation. This study evaluated the performance of each missing data approach when data were MAR or MCAR. The proposed multiple imputation approach and surrogate splits had superior performance with the proposed multiple imputation approach performing best in the more severe missing data conditions. We conclude with recommendations for handling missing data in DTs.


2017 ◽  
Vol 43 (3) ◽  
pp. 316-353 ◽  
Author(s):  
Simon Grund ◽  
Oliver Lüdtke ◽  
Alexander Robitzsch

Multiple imputation (MI) can be used to address missing data at Level 2 in multilevel research. In this article, we compare joint modeling (JM) and the fully conditional specification (FCS) of MI as well as different strategies for including auxiliary variables at Level 1 using either their manifest or their latent cluster means. We show with theoretical arguments and computer simulations that (a) an FCS approach that uses latent cluster means is comparable to JM and (b) using manifest cluster means provides similar results except in relatively extreme cases with unbalanced data. We outline a computational procedure for including latent cluster means in an FCS approach using plausible values and provide an example using data from the Programme for International Student Assessment 2012 study.


2016 ◽  
Vol 27 (9) ◽  
pp. 2610-2626 ◽  
Author(s):  
Thomas R Sullivan ◽  
Ian R White ◽  
Amy B Salter ◽  
Philip Ryan ◽  
Katherine J Lee

The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group.


2021 ◽  
Author(s):  
Marcus Richard Waldman ◽  
Katherine E. Masyn

It is well established that omitting important variables that are related to the propensity for missingness can lead to biased parameter estimates and invalid inference. Nevertheless, researchers conducting a person-centered analysis ubiquitously adopt a full information maximum likelihood (FIML) approach to treat missing data in a manner that assumes the missingness is only related to the observed indicators and is not related to any external variables. Such an assumption is generally considered overly restrictive in the behavioral sciences where the data are observational in nature. At the same time, previous research has discouraged the adoption of multiple imputation to treat missing data in person-centered analyses because traditional imputation models make a single-class assumption and do not reflect the multiple group structure of data with latent subpopulations (Enders & Gottschall, 2011). However, more modern imputation models that rely on recursive partitioning do not impose a single-class structure to the data. Focusing on latent profile analysis, we demonstrate in simulations that in samples of N = 1,200 or greater, recursive partitioning imputation algorithms can effectively incorporate external information from auxiliary variables to attenuate nonresponse bias better than FIML and multivariate normal imputation. Moreover, we find that recursive imputation models lead to confidence intervals with adequate coverage and they better recover posterior class probabilities than alternative missing data strategies. Taken together, our findings point to the promise and potential of multiple imputation in person-centered analyses once remaining methodological gaps around pooling and class enumeration are filled.


Sign in / Sign up

Export Citation Format

Share Document