Missing Data - Better "Not to Have Them", but What If You Do? (Part 1)

Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.

2015 ◽  
Vol 5 (2) ◽  
pp. 137-148 ◽  
Author(s):  
Jeremy N.V Miles ◽  
Priscillia Hunt

Purpose – In applied psychology research settings, such as criminal psychology, missing data are to be expected. Missing data can cause problems with both biased estimates and lack of statistical power. The paper aims to discuss these issues. Design/methodology/approach – Recently, sophisticated methods for appropriately dealing with missing data, so as to minimize bias and to maximize power have been developed. In this paper the authors use an artificial data set to demonstrate the problems that can arise with missing data, and make naïve attempts to handle data sets where some data are missing. Findings – With the artificial data set, and a data set comprising of the results of a survey investigating prices paid for recreational and medical marijuana, the authors demonstrate the use of multiple imputation and maximum likelihood estimation for obtaining appropriate estimates and standard errors when data are missing. Originality/value – Missing data are ubiquitous in applied research. This paper demonstrates that techniques for handling missing data are accessible and should be employed by researchers.


2016 ◽  
Vol 53 (2) ◽  
pp. 83-103 ◽  
Author(s):  
Dilip C. Nath ◽  
Ramesh K. Vishwakarma ◽  
Atanu Bhattacharjee

AbstractMethods for dealing with missing data in clinical trials have received increased attention from the regulators and practitioners in the pharmaceutical industry over the last few years. Consideration of missing data in a study is important as they can lead to substantial biases and have an impact on overall statistical power. This problem may be caused by patients dropping before completion of the study. The new guidelines of the International Conference on Harmonization place great emphasis on the importance of carefully choosing primary analysis methods based on clearly formulated assumptions regarding the missingness mechanism. The reason for dropout or withdrawal would be either related to the trial (e.g. adverse event, death, unpleasant study procedures, lack of improvement) or unrelated to the trial (e.g. moving away, unrelated disease). We applied selection models on liver cirrhosis patient data to analyse the treatment efficiency comparing the surgery of liver cirrhosis patients with consenting for participation HFLPC (Human Fatal Liver Progenitor Cells) infusion with surgery alone. It was found that comparison between treatment conditions when missing values are ignored potentially leads to biased conclusions.


2018 ◽  
Author(s):  
Kieu Trinh Do ◽  
Simone Wahl ◽  
Johannes Raffler ◽  
Sophie Molnos ◽  
Michael Laimighofer ◽  
...  

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S972-S972
Author(s):  
Chen Kan ◽  
Won Hwa Kim ◽  
Ling Xu ◽  
Noelle L Fields

Abstract Background: Questionnaires are widely used to evaluate cognitive functions, depression, and loneliness of persons with dementia (PWDs). Successful assessment and treatment of dementia hinge on effective analysis of PWDs’ answers. However, many studies, especially pilot ones, are with small sample sizes. Further, most of them contain missing data as PWDs skip some study sessions due to their clinical conditions. Conventional imputation strategies are not well-suited as bias will be introduced because of insufficient samples. Method: A novel machine learning framework was developed based on harmonic analysis on graphs to robustly handle missing values. Participants were first embedded as nodes in the graph with edges derived by their similarities based on demographic information, activities of daily living, etc. Then, questionnaire scores with missing values were regarded as a function on the nodes, and they were estimated based on spectral analysis of the graph with a smoothness constraint. The proposed approach was evaluated using data from our pilot study of dementia subjects (N=15) with 15% data missing. Result: A few complete variables (binary or ordinal) were available for all participants. For each variable, we randomly removed 5 scores to mimic missing values. With our approach, we could recover all missing values with 90% accuracy on average. We were also able to impute the actual missing values in the dataset within reasonable ranges. Conclusion: Our proposed approach imputes missing values with high accuracy despite the small sample size. The proposed approach will significantly boost statistical power of various small-scale studies with missing data.


2019 ◽  
Vol 20 (S24) ◽  
Author(s):  
Jasmit Shah ◽  
Guy N. Brock ◽  
Jeremy Gaskins

Abstract Background With the rise of metabolomics, the development of methods to address analytical challenges in the analysis of metabolomics data is of great importance. Missing values (MVs) are pervasive, yet the treatment of MVs can have a substantial impact on downstream statistical analyses. The MVs problem in metabolomics is quite challenging and can arise because the metabolite is not biologically present in the sample, or is present in the sample but at a concentration below the lower limit of detection (LOD), or is present in the sample but undetected due to technical issues related to sample pre-processing steps. The former is considered missing not at random (MNAR) while the latter is an example of missing at random (MAR). Typically, such MVs are substituted by a minimum value, which may lead to severely biased results in downstream analyses. Results We develop a Bayesian model, called BayesMetab, that systematically accounts for missing values based on a Markov chain Monte Carlo (MCMC) algorithm that incorporates data augmentation by allowing MVs to be due to either truncation below the LOD or other technical reasons unrelated to its abundance. Based on a variety of performance metrics (power for detecting differential abundance, area under the curve, bias and MSE for parameter estimates), our simulation results indicate that BayesMetab outperformed other imputation algorithms when there is a mixture of missingness due to MAR and MNAR. Further, our approach was competitive with other methods tailored specifically to MNAR in situations where missing data were completely MNAR. Applying our approach to an analysis of metabolomics data from a mouse myocardial infarction revealed several statistically significant metabolites not previously identified that were of direct biological relevance to the study. Conclusions Our findings demonstrate that BayesMetab has improved performance in imputing the missing values and performing statistical inference compared to other current methods when missing values are due to a mixture of MNAR and MAR. Analysis of real metabolomics data strongly suggests this mixture is likely to occur in practice, and thus, it is important to consider an imputation model that accounts for a mixture of missing data types.


2021 ◽  
Vol 12 ◽  
Author(s):  
Lihan Chen ◽  
Victoria Savalei

In missing data analysis, the reporting of missing rates is insufficient for the readers to determine the impact of missing data on the efficiency of parameter estimates. A more diagnostic measure, the fraction of missing information (FMI), shows how the standard errors of parameter estimates increase from the information loss due to ignorable missing data. FMI is well-known in the multiple imputation literature (Rubin, 1987), but it has only been more recently developed for full information maximum likelihood (Savalei and Rhemtulla, 2012). Sample FMI estimates using this approach have since then been made accessible as part of the lavaan package (Rosseel, 2012) in the R statistical programming language. However, the properties of FMI estimates at finite sample sizes have not been the subject of comprehensive investigation. In this paper, we present a simulation study on the properties of three sample FMI estimates from FIML in two common models in psychology, regression and two-factor analysis. We summarize the performance of these FMI estimates and make recommendations on their application.


Sign in / Sign up

Export Citation Format

Share Document