Multiple Imputation with Predictive Mean Matching Method for Numerical Missing Data

Author(s):  
Emha Fathul Akmam ◽  
Titin Siswantining ◽  
Saskya Mary Soemartojo ◽  
Devvi Sarwinda
Author(s):  
Jin Hyuk Lee ◽  
J. Charles Huber Jr.

Background: Multiple Imputation (MI) is known as an effective method for handling missing data in public health research. However, it is not clear that the method will be effective when the data contain a high percentage of missing observations on a variable. Methods: Using data from “Predictive Study of Coronary Heart Disease” study, this study examined the effectiveness of multiple imputation in data with 20% missing to 80% missing observations using absolute bias (|bias|) and Root Mean Square Error (RMSE) of MI measured under Missing Completely at Random (MCAR), Missing at Random (MAR), and Not Missing at Random (NMAR) assumptions. Results: The |bias| and RMSE of MI was much smaller than of the results of CCA under all missing mechanisms, especially with a high percentage of missing. In addition, the |bias| and RMSE of MI were consistent regardless of increasing imputation numbers from M=10 to M=50. Moreover, when comparing imputation mechanisms, MCMC method had universally smaller |bias| and RMSE than those of Regression method and Predictive Mean Matching method under all missing mechanisms. Conclusion: As missing percentages become higher, using MI is recommended, because MI produced less biased estimates under all missing mechanisms. However, when large proportions of data are missing, other things need to be considered such as the number of imputations, imputation mechanisms, and missing data mechanisms for proper imputation.


Rheumatology ◽  
2021 ◽  
Vol 60 (Supplement_1) ◽  
Author(s):  
Alice Gottlieb ◽  
Frank Behrens ◽  
Peter Nash ◽  
Joseph F Merola ◽  
Pascale Pellet ◽  
...  

Abstract Background/Aims  Psoriatic arthritis (PsA) is a heterogeneous disease comprising musculoskeletal and dermatological manifestations, especially plaque psoriasis. Secukinumab, an interleukin17A inhibitor, provided significantly greater PASI75/100 responses in two head-to-head trials versus etanercept or ustekinumab, a tumour necrosis factor inhibitor (TNFi), in patients with moderate-to-severe plaque psoriasis. The EXCEED study (NCT02745080) investigated whether secukinumab was superior to adalimumab, another TNFi, as monotherapy in biologic-naive active PsA patients with active plaque psoriasis (defined as having ≥1 psoriatic plaque of ≥ 2 cm diameter, nail changes consistent with psoriasis or documented history of plaque psoriasis). Here we report the pre-specified skin outcomes from the EXCEED study in the subset of patients with ≥3% body surface area (BSA) affected with psoriasis at baseline. Methods  In this head-to-head, Phase 3b, randomised, double-blind, active-controlled, multicentre, parallel-group trial, patients were randomised to receive subcutaneous secukinumab 300 mg at baseline and Weeks 1-4, followed by dosing every 4 weeks until Week 48, or subcutaneous adalimumab 40 mg at baseline followed by the same dosing every 2 weeks until Week 50. The primary endpoint was superiority of secukinumab versus adalimumab on ACR20 response at Week 52. Pre-specified outcomes included the proportion of patients achieving a combined ACR50 and PASI100 response, PASI100 response, and absolute PASI score ≤3. Missing data were handled using multiple imputation. Results  Overall, 853 patients were randomised to receive secukinumab (n = 426) or adalimumab (n = 427). At baseline, 215 and 202 patients had at least 3% BSA affected with psoriasis in the secukinumab and adalimumab groups, respectively. At Week 52, more patients achieved simultaneous improvement in ACR50 and PASI100 response with secukinumab versus adalimumab (30.7% versus 19.2%, respectively; P = 0.0087). Greater efficacy was demonstrated for secukinumab versus adalimumab for PASI100 responses and for the proportion of patients achieving absolute PASI score ≤3 (Table 1). Conclusion  In this pre-specified analysis, secukinumab provided higher responses compared with adalimumab in achievement of combined improvement in joint and skin disease (combined ACR50 and PASI100 response) and in skin-specific endpoints (PASI100 and absolute PASI score ≤3) at Week 52. P189 Table 1:Skin-specific outcomes at Week 52Endpoints, % responseSEC 300 mg (N = 215)ADA 40 mg (N = 202)P value (unadjusted)PASI10046300.0007Combined ACR50 and PASI10031190.0087Absolute PASI score ≤379650.0015P value vs ADA; unadjusted P values are presented. Multiple imputation was used for handling missing data. ADA, adalimumab; ACR, American College of Rheumatology; N, number of patients in the psoriasis subset; PASI, Psoriasis Area and Severity Index; SEC, secukinumab. Disclosure  A. Gottlieb: Grants/research support; A.G. has received research support, consultation fees or speaker honoraria from Pfizer, AbbVie, BMS, Lilly, MSD, Novartis, Roche, Sanofi, Sandoz, Nordic, Celltrion and UCB. F. Behrens: Consultancies; F.B. is a consultant for Pfizer, AbbVie, Sanofi, Lilly, Novartis, Genzyme, Boehringer Ingelheim, Janssen, MSD, Celgene, Roche and Chugai. Grants/research support; F.B. has received grant/research support from Pfizer, Janssen, Chugai, Celgene, Lilly and Roche. P. Nash: Consultancies; P.N. is a consultant for AbbVie, Bristol Myers Squibb, Celgene, Eli Lilly, Gilead, Janssen, MSD, Novartis, Pfizer Inc., Roche, Sanofi and UCB. Member of speakers’ bureau; for AbbVie, Bristol Myers Squibb, Celgene, Eli Lilly, Gilead, Janssen, MSD, Novartis, Pfizer Inc., Roche, Sanofi and UCB. Grants/research support; P.N. has received research support from AbbVie, Bristol Myers Squibb, Celgene, Eli Lilly and Company, Gilead, Janssen, MSD, Novartis, Pfizer Inc, Roche, Sanofi and UCB. J. Merola: Consultancies; J.F.M. is a consultant for Merck, AbbVie, Dermavant, Eli Lilly, Novartis, Janssen, UCB Pharma, Celgene, Sanofi, Regeneron, Arena, Sun Pharma, Biogen, Pfizer, EMD Sorono, Avotres and LEO Pharma. P. Pellet: Corporate appointments; P.P. is an employee of Novartis. Shareholder/stock ownership; P.P. is a shareholder of Novartis. L. Pricop: Corporate appointments; L.P. is an employee of Novartis. Shareholder/stock ownership; L.P. is a shareholder of Novartis. I. McInnes: Consultancies; I.M. is a consultant for AbbVie, Bristol Myers Squibb, Celgene, Eli Lilly and Company, Gilead, Janssen, Novartis, Pfizer and UCB. Grants/research support; I.M. has received grant/research support from Bristol Myers Squibb, Celgene, Eli Lilly and Company, Janssen and UCB.


2015 ◽  
Vol 8 (1) ◽  
pp. 133 ◽  
Author(s):  
Hamid Heidarian Miri ◽  
Jafar Hassanzadeh ◽  
Abdolreza Rajaeefard ◽  
Majid Mirmohammadkhani ◽  
Kambiz Ahmadi Angali

<p><strong>BACKGROUND: </strong>This study was carried out to use multiple imputation (MI) in order to correct for the potential nonresponse bias in measurements related to variable fasting blood glucose (FBS) in non-communicable disease risk factors survey conducted in Iran in 2007.</p> <p><strong>METHODS: </strong>Five multiple imputation methods as bootstrap expectation maximization, multivariate normal regression, univariate linear regression, MI by chained equation, and predictive mean matching were applied to impute variable fasting blood sugar. To make FBS consistent with normality assumption natural logarithm (Ln) and Box-Cox (BC) transformations were used prior to imputation. Measurements from which we intended to remove nonresponse bias included mean of FBS and percentage of those with high FBS.</p> <p><strong>RESULTS:</strong> For mean of FBS results didn’t considerably change after applying MI methods. Regarding the prevalence of high blood sugar all methods on original scale tended to increase the estimates except for predictive mean matching that along with all methods on BC or Ln transformed data didn’t change the results.</p> <p><strong>CONCLUSIONS: </strong>FBS<strong>-</strong>related<strong> </strong>measurements didn’t change after applying different MI methods. It seems that<strong> </strong>nonresponse bias was not an important challenge regarding these measurements. However use of MI methods resulted in more efficient estimations. Further studies are encouraged on accuracy of MI methods in these settings.</p>


2021 ◽  
Vol 50 (Supplement_1) ◽  
Author(s):  
Jiaxin Zhang ◽  
S. Ghazaleh Dashti ◽  
John B. Carlin ◽  
Katherine J. Lee ◽  
Margarita Moreno-Betancur

Abstract Background Outcome regression remains widely applied for estimating causal effects in observational studies, in which causal inference is conceptualised as emulating a randomized controlled trial (RCT). Multiple imputation (MI) is a commonly used method for handling missing data, but while in RCTs it has been shown that MI should be conducted by treatment group to reduce bias, whether imputation should be conducted by exposure group in observational studies has not been studied. Methods We conducted a simulation study to evaluate the performance of seven methods for handling missing data: Complete-case analysis (CCA), MI of main effect, MI with interactions (between exposure and: outcome, a strong confounder, outcome and a strong confounder, all incomplete), and MI conducted by exposure group. We simulated data based on an example from the Victorian Adolescent Health Cohort Study. Three exposure prevalences and seven outcome generation models were considered, the latter ranging from no interaction to strong-positive or negative exposure-confounder interaction. Various missingness scenarios were examined: with incomplete outcome only or also incomplete confounders, and three levels of complexity regarding the missingness mechanism. Results For all scenarios, MI by exposure led to the least bias, followed by MI approaches that included exposure-confounder interactions. Conclusions If MI is adopted in outcome regression, we recommend conducting MI by exposure group and, when not feasible, including exposure-confounder interactions in the imputation model. Key messages Similar to RCTs, MI should be conducted by exposure group when estimating average causal effects using outcome regression in observational studies.


Author(s):  
Tra My Pham ◽  
Irene Petersen ◽  
James Carpenter ◽  
Tim Morris

ABSTRACT BackgroundEthnicity is an important factor to be considered in health research because of its association with inequality in disease prevalence and the utilisation of healthcare. Ethnicity recording has been incorporated in primary care electronic health records, and hence is available in large UK primary care databases such as The Health Improvement Network (THIN). However, since primary care data are routinely collected for clinical purposes, a large amount of data that are relevant for research including ethnicity is often missing. A popular approach for missing data is multiple imputation (MI). However, the conventional MI method assuming data are missing at random does not give plausible estimates of the ethnicity distribution in THIN compared to the general UK population. This might be due to the fact that ethnicity data in primary care are likely to be missing not at random. ObjectivesI propose a new MI method, termed ‘weighted multiple imputation’, to deal with data that are missing not at random in categorical variables.MethodsWeighted MI combines MI and probability weights which are calculated using external data sources. Census summary statistics for ethnicity can be used to form weights in weighted MI such that the correct marginal ethnic breakdown is recovered in THIN. I conducted a simulation study to examine weighted MI when ethnicity data are missing not at random. In this simulation study which resembled a THIN dataset, ethnicity was an independent variable in a survival model alongside other covariates. Weighted MI was compared to the conventional MI and other traditional missing data methods including complete case analysis and single imputation.ResultsWhile a small bias was still present in ethnicity coefficient estimates under weighted MI, it was less severe compared to MI assuming missing at random. Complete case analysis and single imputation were inadequate to handle data that are missing not at random in ethnicity.ConclusionsAlthough not a total cure, weighted MI represents a pragmatic approach that has potential applications not only in ethnicity but also in other incomplete categorical health indicators in electronic health records.


Author(s):  
Thelma Dede Baddoo ◽  
Zhijia Li ◽  
Samuel Nii Odai ◽  
Kenneth Rodolphe Chabi Boni ◽  
Isaac Kwesi Nooni ◽  
...  

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.


2021 ◽  
Author(s):  
Rahibu A. Abassi ◽  
Amina S. Msengwa ◽  
Rocky R. J. Akarro

Abstract Background Clinical data are at risk of having missing or incomplete values for several reasons including patients’ failure to attend clinical measurements, wrong interpretations of measurements, and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission of missed observation during statistical analysis especially if a dataset is considerably small. The objective of this study is to compare several imputation methods in terms of efficiency in filling-in the missing data so as to increase the prediction and classification accuracy in breast cancer dataset. Methods Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, and multiple imputations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion The predictive mean matching imputation showed higher accuracy in estimating and replacing missing/incomplete data values in a real breast cancer dataset under the study. It is a more effective and good method to handle missing data in this scenario. We recommend to replace missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables, as it improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.


Sign in / Sign up

Export Citation Format

Share Document