A Nonparametric Imputation Approach for Dealing With Missing Variables in SHR Data

2004 ◽  
Vol 8 (3) ◽  
pp. 255-266 ◽  
Author(s):  
Robert L. Flewelling
2004 ◽  
Vol 80 (2) ◽  
pp. 271-278 ◽  
Author(s):  
Badre T Hassani ◽  
Valerie LeMay ◽  
Peter L Marshall ◽  
H. Temesgen ◽  
Abdel-Azim Zumrawi

Two imputation techniques for predicting natural regeneration in complex stands prevalent in southeastern British Columbia (BC) were compared using data from the Interior Cedar-Hemlock moist warm subzone variant 2 (ICHmw2) in the vicinity of Nelson, BC. Imputation approaches offer advantages over other modeling approaches in that they provide estimates of many variables at one time (multivariate) and there are no assumptions regarding the probability distributions of the variables to be predicted. For the tabular imputation, the average regeneration per ha was calculated for each combination of five site groups, two residual density classes, five time-since-disturbance intervals, species, and height classes. For Most Similar Neighbour (MSN) imputation, data with both regeneration information, and overstory trees and site information (called reference plots) were used to impute regeneration of plots with only overstory trees and site information (called target plots), by selecting the most similar plot. Of the two approaches studied, the MSN approach gave better results than tabular imputation. The tabular imputation approach is simpler to implement, since tables of results can be published and made available for use. However, the MSN software has been made freely available, resulting in greater ease of access. Key words: multi-species, multi-cohort, nonparametric imputation, multivariate prediction, regeneration estimation


2020 ◽  
Vol 41 (Supplement_2) ◽  
Author(s):  
P Codina ◽  
M De Antonio ◽  
E Santiago-Vacas ◽  
M Domingo ◽  
E Zamora ◽  
...  

Abstract Background Heart failure (HF) contemporary management has significantly improved over the past two decades leading to better survival. How application of the contemporary HF management guidelines affects the risk of death estimated by available web-based risk scores is not elucidated. Objective To assess changes in mortality risk prediction after a after a 12-month management period in a multidisciplinary HF Clinic. Methods Out of 1,689 consecutive patients with HF admitted at our ambulatory HF Clinic from May 2006 to November 2018, those who completed one year follow-up were considered for the study. Patients without NTproBNP measurement or with more than 3 missing variables for risk estimation were excluded. Three contemporary web-based HF risk scores were evaluated: MAGGIC-HF, Seattle HF Model (SHFM) and the Barcelona Bio-HF Calculator containing NTproBNP (BCN Bio-HF). Risk of all-cause death at one year and at 3 years were calculated at baseline and re-evaluated after 12-month management in a multidsisciplinary HF Clinic. Wilcoxon paired data test was used to compare changes in mortality risk estimation over time and test equality of matched pairs for comparing estimated change among tools. 442 patients used to derive the Barcelona Bio-HF Calculator were excluded for discrimination purposes. Results 1,157 patients were included (age 65.7±12.7 years, 70.4% men). A significant reduction in mortality risk estimation was observed with the three HF risk scores evaluated at 12-months (Table). The BCN Bio-HF model showed significantly different changes in risk estimation, fact that indeed was partnered with numerically better discrimination. AUC at 1 and 3 years, respectively, were: BCN Bio-HF (0.773 and 0.775), MAGGIC HF (0.686 and 0.748) and SHFM (0.773 and 0.739). Conclusions The three web-based risk scores evaluated showed a significant reduction in mortality risk estimation after 12 month management in a multidisciplinary HF Clinic. The BCN Bio-HF score showed higher reduction in estimated risk, together with better discrimination, likely because it incorporates contemporary treatment and use of biomarkers. Funding Acknowledgement Type of funding source: None


Author(s):  
Caio Ribeiro ◽  
Alex A. Freitas

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.


Author(s):  
Gustavo H. da Silva ◽  
Santos H. B. Dias ◽  
Lucas B. Ferreira ◽  
Jannaylton É. O. Santos ◽  
Fernando F. da Cunha

ABSTRACT FAO Penman-Monteith (FO-PM) is considered the standard method for the estimation of reference evapotranspiration (ET0) but requires various meteorological data, which are often not available. The objective of this work was to evaluate the performance of the FAO-PM method with limited meteorological data and other methods as alternatives to estimate ET0 in Jaíba-MG. The study used daily meteorological data from 2007 to 2016 of the National Institute of Meteorology’s station. Daily ET0 values were randomized, and 70% of these were used to determine the calibration parameters of the ET0 for the equations of each method under study. The remaining data were used to test the calibration against the standard method. Performance evaluation was based on Willmott’s index of agreement, confidence coefficient and root-mean-square error. When one meteorological variable was missing, either solar radiation, relative air humidity or wind speed, or in the simultaneous absence of wind speed and relative air humidity, the FAO-PM method showed the best performances and, therefore, was recommended for Jaíba. The FAO-PM method with two missing variables, one of them being solar radiation, showed intermediate performance. Methods that used only air temperature data are not recommended for the region.


2016 ◽  
Vol 02 (02) ◽  
Author(s):  
Ali Ali B ◽  
Fortun M ◽  
Belzunegui T

2021 ◽  
Vol 39 (28_suppl) ◽  
pp. 143-143
Author(s):  
Marita Yaghi ◽  
Nadeem Bilani ◽  
Iktej Jabbal ◽  
Leah Elson ◽  
Maroun Bou Zerdan ◽  
...  

143 Background: The National Cancer Database (NCDB) is a large registry that collates real-world medical record data from millions of patients in the United States. A previous published study using the NCDB found that gaps in the medical record were associated with worse overall survival outcomes. We investigated cases of breast cancer in this registry to understand which factors were predictive of records with missing data. Methods: We screened for missing data in 54 clinical parameters documented by the NCDB pertaining to the diagnosis, workup, management and survival of patients with breast cancer diagnosed between 2004 and 2017. We performed univariate statistics to describe gaps in the dataset, followed by multivariate logistic regression modeling to identify factors associated lack of completeness of the medical record – defined as the presence of > 3 missing variables. Results: A total of n = 2,981,732 patients were included in this analysis. The median number of missing variables per record was 3 (5.6% of clinical parameters surveyed). 52.1% of records had ≤ 3 variables missing, while 47.9% had > 3 variables missing. Predictors of a record with missing data in > 3 variables were: age, race, insurance status and facility type . Regarding race, we found that records of Asian patients were less likely to have missing data as compared to records of White patients (OR 0.75, 95% CI: 0.74-0.76, p < 0.001). Conversely, there was no difference in completeness of the medical record between Black and White patients (OR 0.99, 95% CI: 0.99-1.01, p = 0.890). Patients with private insurance (OR 0.77, 95% CI 0.76-0.79, p < 0.001), or Medicaid (OR 0.65, 95% CI 0.64-0.67, p < 0.001) or Medicare (OR 0.66, 95% CI 0.64-0.67, p < 0.001) were also less likely to have missing data compared to uninsured patients, with patients on private insurance being the least likely to have incomplete records. Finally, patient records from academic programs (OR 0.91, 95% CI 0.90-0.92, p < 0.001) were less likely to contain > 3 missing variables compared to records from patients treated at community cancer programs. Conclusions: Despite high fidelity of NCDB data, social determinants of health including insurance status and treating facility type, were associated with differences in the completeness of the medical record. Improvements in documentation and data quality are necessary to optimize use of real-world data in cancer registries. Further research is needed to determine how these differences could be independently associated with inferior outcomes.


2021 ◽  
pp. e1-e9
Author(s):  
Elizabeth A. Erdman ◽  
Leonard D. Young ◽  
Dana L. Bernson ◽  
Cici Bauer ◽  
Kenneth Chui ◽  
...  

Objectives. To develop an imputation method to produce estimates for suppressed values within a shared government administrative data set to facilitate accurate data sharing and statistical and spatial analyses. Methods. We developed an imputation approach that incorporated known features of suppressed Massachusetts surveillance data from 2011 to 2017 to predict missing values more precisely. Our methods for 35 de-identified opioid prescription data sets combined modified previous or next substitution followed by mean imputation and a count adjustment to estimate suppressed values before sharing. We modeled 4 methods and compared the results to baseline mean imputation. Results. We assessed performance by comparing root mean squared error (RMSE), mean absolute error (MAE), and proportional variance between imputed and suppressed values. Our method outperformed mean imputation; we retained 46% of the suppressed value’s proportional variance with better precision (22% lower RMSE and 26% lower MAE) than simple mean imputation. Conclusions. Our easy-to-implement imputation technique largely overcomes the adverse effects of low count value suppression with superior results to simple mean imputation. This novel method is generalizable to researchers sharing protected public health surveillance data. (Am J Public Health. Published online ahead of print September 16, 2021: e1–e9. https://doi.org/10.2105/AJPH.2021.306432 )


2010 ◽  
Vol 6 (3) ◽  
pp. 1-10 ◽  
Author(s):  
Shichao Zhang

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.


Author(s):  
Tshilidzi Marwala

This chapter develops and compares the merits of three different data imputation models by using accuracy measures. The three methods are auto-associative neural networks, a principal component analysis and support vector regression all combined with cultural genetic algorithms to impute missing variables. The use of a principal component analysis improves the overall performance of the auto-associative network while the use of support vector regression shows promising potential for future investigation. Imputation accuracies up to 97.4% for some of the variables are achieved.


Sign in / Sign up

Export Citation Format

Share Document