predictive mean matching
Recently Published Documents


TOTAL DOCUMENTS

30
(FIVE YEARS 20)

H-INDEX

5
(FIVE YEARS 0)

2022 ◽  
Vol 2022 ◽  
pp. 1-8
Author(s):  
Xiaoying Lv ◽  
Ruonan Zhao ◽  
Tongsheng Su ◽  
Liyun He ◽  
Rui Song ◽  
...  

Objective. To explore the optimal fitting path of missing data of the Scale to make the fitting data close to the real situation of patients’ data. Methods. Based on the complete data set of the SDS of 507 patients with stroke, the data simulation sets of Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) were constructed by R software, respectively, with missing rates of 5%, 10%, 15%, 20%, 25%, 30%, 35%, and 40% under three missing mechanisms. Mean substitution (MS), random forest regression (RFR), and predictive mean matching (PMM) were used to fit the data. Root mean square error (RMSE), the width of 95% confidence intervals (95% CI), and Spearman correlation coefficient (SCC) were used to evaluate the fitting effect and determine the optimal fitting path. Results. when dealing with the problem of missing data in scales, the optimal fitting path is ① under the MCAR deletion mechanism, when the deletion proportion is less than 20%, the MS method is the most convenient; when the missing ratio is greater than 20%, RFR algorithm is the best fitting method. ② Under the Mar mechanism, when the deletion ratio is less than 35%, the MS method is the most convenient. When the deletion ratio is greater than 35%, RFR has a better correlation. ③ Under the mechanism of MNAR, RFR is the best data fitting method, especially when the missing proportion is greater than 30%. In reality, when the deletion ratio is small, the complete case deletion method is the most commonly used, but the RFR algorithm can greatly expand the application scope of samples and save the cost of clinical research when the deletion ratio is less than 30%. The best way to deal with data missing should be based on the missing mechanism and proportion of actual data, and choose the best method between the statistical analysis ability of the research team, the effectiveness of the method, and the understanding of readers.


2021 ◽  
Vol 11 (12) ◽  
pp. 1356
Author(s):  
Carlos Traynor ◽  
Tarjinder Sahota ◽  
Helen Tomkinson ◽  
Ignacio Gonzalez-Garcia ◽  
Neil Evans ◽  
...  

Missing data is a universal problem in analysing Real-World Evidence (RWE) datasets. In RWE datasets, there is a need to understand which features best correlate with clinical outcomes. In this context, the missing status of several biomarkers may appear as gaps in the dataset that hide meaningful values for analysis. Imputation methods are general strategies that replace missing values with plausible values. Using the Flatiron NSCLC dataset, including more than 35,000 subjects, we compare the imputation performance of six such methods on missing data: predictive mean matching, expectation-maximisation, factorial analysis, random forest, generative adversarial networks and multivariate imputations with tabular networks. We also conduct extensive synthetic data experiments with structural causal models. Statistical learning from incomplete datasets should select an appropriate imputation algorithm accounting for the nature of missingness, the impact of missing data, and the distribution shift induced by the imputation algorithm. For our synthetic data experiments, tabular networks had the best overall performance. Methods using neural networks are promising for complex datasets with non-linearities. However, conventional methods such as predictive mean matching work well for the Flatiron NSCLC biomarker dataset.


2021 ◽  
pp. 1-9
Author(s):  
Moritz Marbach

Abstract Imputing missing values is an important preprocessing step in data analysis, but the literature offers little guidance on how to choose between imputation models. This letter suggests adopting the imputation model that generates a density of imputed values most similar to those of the observed values for an incomplete variable after balancing all other covariates. We recommend stable balancing weights as a practical approach to balance covariates whose distribution is expected to differ if the values are not missing completely at random. After balancing, discrepancy statistics can be used to compare the density of imputed and observed values. We illustrate the application of the suggested approach using simulated and real-world survey data from the American National Election Study, comparing popular imputation approaches including random forests, hot-deck, predictive mean matching, and multivariate normal imputation. An R package implementing the suggested approach accompanies this letter.


2021 ◽  
Author(s):  
Sara Javadi ◽  
Abbas Bahrampour ◽  
Mohammad Mehdi Saber ◽  
Mohammad Reza Baneshi

Abstract Background: Among the new multiple imputation methods, Multiple Imputation by Chained ‎Equations (MICE) is a ‎popular ‎approach for implementing multiple imputations because of its ‎flexibility. Our main focus in this study ‎is to ‎compare the performance of parametric ‎imputation models based on predictive mean matching and ‎recursive partitioning methods ‎in multiple imputation by chained equations in the ‎presence of interaction in the ‎data.Methods: We compared the performance of parametric and tree-based imputation methods via simulation using two data generation models. For each combination of data generation model and imputation method, the following steps were performed: data generation, removal of observations, imputation, logistic regression analysis, and calculation of bias, Coverage Probability (CP), and Confidence Interval (CI) width for each coefficient Furthermore, model-based and empirical SE, and estimated proportion of the variance attributable to the missing data (λ) were calculated.Results: ‎We have shown by simulation that to impute a binary response in ‎observations involving an ‎interaction, manually interring the interaction term into the imputation model in the ‎predictive mean matching ‎model improves the performance of the PMM method compared to the recursive partitioning models in ‎ ‎multiple imputation by chained equations.‎ The parametric method in which we entered the interaction model into the imputation model (MICE-‎‎‎Interaction) led to smaller bias, slightly higher coverage probability for the interaction effect, but it ‎had ‎slightly ‎wider confidence intervals than tree-based imputation (especially classification and ‎regression ‎trees). Conclusions: The application of MICE-Interaction led to better performance than ‎recursive ‎partitioning methods in MICE, although ‎the user is interested in estimating the interaction and does not ‎know ‎enough about the structure of the observations, recursive partitioning methods can be ‎suggested to impute ‎the ‎missing values.


2021 ◽  
Author(s):  
Rahibu A. Abassi ◽  
Amina S. Msengwa ◽  
Rocky R. J. Akarro

Abstract Background Clinical data are at risk of having missing or incomplete values for several reasons including patients’ failure to attend clinical measurements, wrong interpretations of measurements, and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission of missed observation during statistical analysis especially if a dataset is considerably small. The objective of this study is to compare several imputation methods in terms of efficiency in filling-in the missing data so as to increase the prediction and classification accuracy in breast cancer dataset. Methods Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, and multiple imputations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion The predictive mean matching imputation showed higher accuracy in estimating and replacing missing/incomplete data values in a real breast cancer dataset under the study. It is a more effective and good method to handle missing data in this scenario. We recommend to replace missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables, as it improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Chong Kim ◽  
Kathryn L. Colborn ◽  
Stef van Buuren ◽  
Timothy Loar ◽  
Jennifer E. Stevens-Lapsley ◽  
...  

AbstractThe purpose of this study was to develop and test personalized predictions for functional recovery after Total Knee Arthroplasty (TKA) surgery, using a novel neighbors-based prediction approach. We used data from 397 patients with TKA to develop the prediction methodology and then tested the predictions in a temporally distinct sample of 202 patients. The Timed Up and Go (TUG) Test was used to assess physical function. Neighbors-based predictions were generated by estimating an index patient’s prognosis from the observed recovery data of previous similar patients (a.k.a., the index patient’s “matches”). Matches were determined by an adaptation of predictive mean matching. Matching characteristics included preoperative TUG time, age, sex and Body Mass Index. The optimal number of matches was determined to be m = 35, based on low bias (− 0.005 standard deviations), accurate coverage (50% of the realized observations within the 50% prediction interval), and acceptable precision (the average width of the 50% prediction interval was 2.33 s). Predictions were well-calibrated in out-of-sample testing. These predictions have the potential to inform care decisions both prior to and following TKA surgery.


Author(s):  
Jin Hyuk Lee ◽  
J. Charles Huber Jr.

Background: Multiple Imputation (MI) is known as an effective method for handling missing data in public health research. However, it is not clear that the method will be effective when the data contain a high percentage of missing observations on a variable. Methods: Using data from “Predictive Study of Coronary Heart Disease” study, this study examined the effectiveness of multiple imputation in data with 20% missing to 80% missing observations using absolute bias (|bias|) and Root Mean Square Error (RMSE) of MI measured under Missing Completely at Random (MCAR), Missing at Random (MAR), and Not Missing at Random (NMAR) assumptions. Results: The |bias| and RMSE of MI was much smaller than of the results of CCA under all missing mechanisms, especially with a high percentage of missing. In addition, the |bias| and RMSE of MI were consistent regardless of increasing imputation numbers from M=10 to M=50. Moreover, when comparing imputation mechanisms, MCMC method had universally smaller |bias| and RMSE than those of Regression method and Predictive Mean Matching method under all missing mechanisms. Conclusion: As missing percentages become higher, using MI is recommended, because MI produced less biased estimates under all missing mechanisms. However, when large proportions of data are missing, other things need to be considered such as the number of imputations, imputation mechanisms, and missing data mechanisms for proper imputation.


2021 ◽  
Vol 80 (Suppl 1) ◽  
pp. 403.1-404
Author(s):  
S. Jurado Zapata ◽  
M. Maurits ◽  
Y. Abraham ◽  
E. Van den Akker ◽  
A. Barton ◽  
...  

Background:Patients who achieve remission promptly could have a specific genetic risk profile that supports regaining immune tolerance. The identification of these genes could provide novel drug targets.Objectives:To test the association between RA genetic risk variants with achieving remission at 6 months.Methods:We computed genetic risk scores (GRS) comprising of the RA susceptibility variants1 and HLA-SE status separately in 4425 patients across eight datasets from inception cohorts. Remission was defined as DAS28CRP<2.6 at 6 months. Missing DAS28CRP values in patients were imputed using predictive mean matching by MICE. We first tested whether baseline DAS28CRP changed with increasing GRS using linear regression. Next, we calculated odds ratios for GRS and HLA-SE on remission using logistic regression. Heterogeneity of the outcome between datasets was mitigated by running inverse variance meta-analysis.Results:Evaluation of the complete dataset, baseline clinical variables did not differ between patients achieving remission and those who did not (Table 1). Distribution of GRS was consistent between datasets. Neither GRS nor HLA-SE was associated with baseline DAS2DAS (OR1.01; 95% CI 0.99-1.04). A fixed effect meta-analysis (Figure 1.) showed no significant effect of the GRS (OR 0.99; 95% CI 0.94-1.03) or HLA-SE (OR 0.8CRP87; 95% CI 0.75-1.01) on remission at 6 months.Table 1.Summary of the data separated by disease activity after 6 months.allRemission at 6 monthsNo remission at 6 monthsN4425*15582430Age, mean (sd)55.38 (13.87)5517 (14.09)55.62 (13.59)Female %68.98%65.43%70.73%ACPA+ %61.94%63.53%61.67%Baseline DAS28, mean (sd)4.76 (1.22)4.47 (1.23)5.1 (1.15)*not all patients had 6 months dataConclusion:In these combined cohorts, RA genetics risk variants are not associated with early disease remission. At baseline there was no difference in genetic risk between patients achieving remission or not. Studies encompassing other genetic variants are needed to elucidate the genetics of RA remission.References:[1]Knevel R et al. Sci Transl Med. 2020;12(545):eaay1548.Acknowledgements:This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No 777357, RTCure.This project has received funding from Pfizer Inc.Disclosure of Interests:Samantha Jurado Zapata: None declared, Marc Maurits: None declared, Yann Abraham Employee of: Pfizer, Erik van den Akker: None declared, Anne Barton: None declared, Philip Brown: None declared, Andrew Cope: None declared, Isidoro González-Álvaro: None declared, Carl Goodyear: None declared, Annette van der Helm - van Mil: None declared, Xinli Hu Employee of: Pfizer, Thomas Huizinga: None declared, Martina Johannesson: None declared, Lars Klareskog: None declared, Dennis Lendrem: None declared, Iain McInnes: None declared, Fraser Morton: None declared, Caron Paterson: None declared, Duncan Porter: None declared, Arthur Pratt: None declared, Luis Rodriguez Rodriguez: None declared, Daniela Sieghart: None declared, Paul Studenic: None declared, Suzanne Verstappen: None declared, Leonid Padyukov: None declared, Aaron Winkler Employee of: Pfizer, John D Isaacs: None declared, Rachel Knevel Grant/research support from: Pfizer


2021 ◽  
pp. 1-11
Author(s):  
Riccardo D’Allerto ◽  
Meri Raggi

Big Data and the ‘Internet of Things’ are transforming the processes of data collection, storage and use. The relationship between data collected first hand (primary data) and data collected by someone else (secondary data) is becoming more fluid. New possibilities for data collection are envisaged. Data integration is emerging as a reliable strategy to overcome data shortage and other challenges such as data coverage, quality, time dis-alignment and representativeness. When we have two (or more) data sources where the units are not (at least partially) overlapping and/or the units’ unique identifiers are unavailable, the different information collected can be integrated by using Micro Statistical Matching (MiSM). MiSM has been used in the social sciences, politics and economics, but there are very few applications that use agricultural and farm data. We present an example of MiSM data integration between primary and secondary farm data on agricultural holdings in the Emilia-Romagna region (Italy). The novelty of the work lies in the fact that integration is carried out with non-parametric MiSM, which is compared to predictive mean matching and Bayesian linear regression. Moreover, the matching validity is assessed with a new strategy. The main issues addressed, the lessons learned and the use in a research field characterised by critical data shortage are discussed.


Author(s):  
V. Jinubala ◽  
P. Jeyakumar

Data Mining is an emerging research field in the analysis of agricultural data. In fact the most important problem in extracting knowledge from the agriculture data is the missing values of the attributes in the selected data set. If such deficiencies are there in the selected data set then it needs to be cleaned during preprocessing of the data in order to obtain a functional data. The main objective of this paper is to analyse the effectiveness of the various imputation methods in producing a complete data set that can be more useful for applying data mining techniques and presented a comparative analysis of the imputation methods for handling missing values. The pest data set of rice crop collected throughout Maharashtra state under Crop Pest Surveillance and Advisory Project (CROPSAP) during 2009-2013 was used for analysis. The different methodologies like Deleting of rows, Mean & Median, Linear regression and Predictive Mean Matching were analysed for Imputation of Missing values. The comparative analysis shows that Predictive Mean Matching Methodology was better than other methods and effective for imputation of missing values in large data set.


Sign in / Sign up

Export Citation Format

Share Document