scholarly journals Evaluating Methods for Imputing Missing Data from Longitudinal Monitoring of Athlete Workload

2021 ◽  
pp. 188-196 ◽  
Author(s):  
Lauren C. Benson ◽  
Carlyn Stilling ◽  
Oluwatoyosi B.A. Owoeye ◽  
Carolyn A. Emery

Missing data can influence calculations of accumulated athlete workload. The objectives were to identify the best single imputation methods and examine workload trends using multiple imputation. External (jumps per hour) and internal (rating of perceived exertion; RPE) workload were recorded for 93 (45 females, 48 males) high school basketball players throughout a season. Recorded data were simulated as missing and imputed using ten imputation methods based on the context of the individual, team and session. Both single imputation and machine learning methods were used to impute the simulated missing data. The difference between the imputed data and the actual workload values was computed as root mean squared error (RMSE). A generalized estimating equation determined the effect of imputation method on RMSE. Multiple imputation of the original dataset, with all known and actual missing workload data, was used to examine trends in longitudinal workload data. Following multiple imputation, a Pearson correlation evaluated the longitudinal association between jump count and sRPE over the season. A single imputation method based on the specific context of the session for which data are missing (team mean) was only outperformed by methods that combine information about the session and the individual (machine learning models). There was a significant and strong association between jump count and sRPE in the original data and imputed datasets using multiple imputation. The amount and nature of the missing data should be considered when choosing a method for single imputation of workload data in youth basketball. Multiple imputation using several predictor variables in a regression model can be used for analyses where workload is accumulated across an entire season.

2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


Mathematics ◽  
2021 ◽  
Vol 9 (24) ◽  
pp. 3252
Author(s):  
Encarnación Álvarez-Verdejo ◽  
Pablo J. Moya-Fernández ◽  
Juan F. Muñoz-Rosas

The problem of missing data is a common feature in any study, and a single imputation method is often applied to deal with this problem. The first contribution of this paper is to analyse the empirical performance of some traditional single imputation methods when they are applied to the estimation of the Gini index, a popular measure of inequality used in many studies. Various methods for constructing confidence intervals for the Gini index are also empirically evaluated. We consider several empirical measures to analyse the performance of estimators and confidence intervals, allowing us to quantify the magnitude of the non-response bias problem. We find extremely large biases under certain non-response mechanisms, and this problem gets noticeably worse as the proportion of missing data increases. For a large correlation coefficient between the target and auxiliary variables, the regression imputation method may notably mitigate this bias problem, yielding appropriate mean square errors. We also find that confidence intervals have poor coverage rates when the probability of data being missing is not uniform, and that the regression imputation method substantially improves the handling of this problem as the correlation coefficient increases.


Missing data imputation is essential task becauseremoving all records with missing values will discard useful information from other attributes. This paper estimates the performanceof prediction for autism dataset with imputed missing values. Statistical imputation methods like mean, imputation with zero or constant and machine learning imputation methods like K-nearest neighbour chained Equation methods were compared with the proposed deep learning imputation method. The predictions of patients with autistic spectrum disorder were measured using support vector machine for imputed dataset. Among the imputation methods, Deeplearningalgorithm outperformed statistical and machine learning imputation methods. The same is validated using significant difference in p values revealed using Friedman’s test


2020 ◽  
Vol 79 (Suppl 1) ◽  
pp. 519.1-519
Author(s):  
A. Alsaber ◽  
A. Al-Herz ◽  
J. Pan ◽  
K. Saleh ◽  
A. Al-Awadhi ◽  
...  

Background:Missing data in clinical epidemiological researches violate the intention to treat principle,reduce statistical power and can induce bias if they are related to patient’s response to treatment. In multiple imputation (MI), covariates are included in the imputation equation to predict the values of missing data.Objectives:To find the best approach to estimate and impute the missing values in Kuwait Registry for Rheumatic Diseases (KRRD) patients data.Methods:A number of methods were implemented for dealing with missing data. These includedMultivariate imputation by chained equations(MICE),K-Nearest Neighbors(KNN),Bayesian Principal Component Analysis(BPCA),EM with Bootstrapping(Amelia II),Sequential Random Forest(MissForest) and mean imputation. Choosing the best imputation method wasjudged by the minimum scores ofRoot Mean Square Error(RMSE),Mean Absolute Error(MAE) andKolmogorov–Smirnov D test statistic(KS) between the imputed datapoints and the original datapoints that were subsequently sat to missing.Results:A total of 1,685 rheumatoid arthritis (RA) patients and 10,613 hospital visits were included in the registry. Among them, we found a number of variables that had missing values exceeding 5% of the total values. These included duration of RA (13.0%), smoking history (26.3%), rheumatoid factor (7.93%), anti-citrullinated peptide antibodies (20.5%), anti-nuclear antibodies (20.4%), sicca symptoms (19.2%), family history of a rheumatic disease (28.5%), steroid therapy (5.94%), ESR (5.16%), CRP (22.9%) and SDAI (38.0%), The results showed that among the methods used, MissForest gave the highest level of accuracy to estimate the missing values. It had the least imputation errors for both continuous and categorical variables at each frequency of missingness and it had the smallest prediction differences when the models used imputed laboratory values. In both data sets, MICE had the second least imputation errors and prediction differences, followed by KNN and mean imputation.Conclusion:MissForest is a highly accurate method of imputation for missing data in KRRD and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in clinical predictive models. This approach can be used in registries to improve the accuracy of data, including the ones for rheumatoid arthritis patients.References:[1]Junninen, H.; Niska, H.; Tuppurainen, K.; Ruuskanen, J.; Kolehmainen, M. Methods for imputation ofmissing values in air quality data sets.Atmospheric Environment2004,38, 2895–2907.[2]Norazian, M.N.; Shukri, Y.A.; Azam, R.N.; Al Bakri, A.M.M. Estimation of missing values in air pollutiondata using single imputation techniques.ScienceAsia2008,34, 341–345.[3]Plaia, A.; Bondi, A. Single imputation method of missing values in environmental pollution data sets.Atmospheric Environment2006,40, 7316–7330.[4]Kabir, G.; Tesfamariam, S.; Hemsing, J.; Sadiq, R. Handling incomplete and missing data in water networkdatabase using imputation methods.Sustainable and Resilient Infrastructure2019, pp. 1–13.[5]Di Zio, M.; Guarnera, U.; Luzi, O. Imputation through finite Gaussian mixture models.ComputationalStatistics & Data Analysis2007,51, 5305–5316.Disclosure of Interests:None declared


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5947
Author(s):  
Liang Zhang

Building operation data are important for monitoring, analysis, modeling, and control of building energy systems. However, missing data is one of the major data quality issues, making data imputation techniques become increasingly important. There are two key research gaps for missing sensor data imputation in buildings: the lack of customized and automated imputation methodology, and the difficulty of the validation of data imputation methods. In this paper, a framework is developed to address these two gaps. First, a validation data generation module is developed based on pattern recognition to create a validation dataset to quantify the performance of data imputation methods. Second, a pool of data imputation methods is tested under the validation dataset to find an optimal single imputation method for each sensor, which is termed as an ensemble method. The method can reflect the specific mechanism and randomness of missing data from each sensor. The effectiveness of the framework is demonstrated by 18 sensors from a real campus building. The overall accuracy of data imputation for those sensors improves by 18.2% on average compared with the best single data imputation method.


2020 ◽  
Vol 10 (14) ◽  
pp. 5020
Author(s):  
Youngdoo Son ◽  
Wonjoon Kim

Estimating stature is essential in the process of personal identification. Because it is difficult to find human remains intact at crime scenes and disaster sites, for instance, methods are needed for estimating stature based on different body parts. For instance, the upper and lower limbs may vary depending on ancestry and sex, and it is of great importance to design adequate methodology for incorporating these in estimating stature. In addition, it is necessary to use machine learning rather than simple linear regression to improve the accuracy of stature estimation. In this study, the accuracy of statures estimated based on anthropometric data was compared using three imputation methods. In addition, by comparing the accuracy among linear and nonlinear classification methods, the best method was derived for estimating stature based on anthropometric data. For both sexes, multiple imputation was superior when the missing data ratio was low, and mean imputation performed well when the ratio was high. The support vector machine recorded the highest accuracy in all ratios of missing data. The findings of this study showed appropriate imputation methods for estimating stature with missing anthropometric data. In particular, the machine learning algorithms can be effectively used for estimating stature in humans.


2014 ◽  
Vol 134 ◽  
pp. 23-33 ◽  
Author(s):  
M.P. Gómez-Carracedo ◽  
J.M. Andrade ◽  
P. López-Mahía ◽  
S. Muniategui ◽  
D. Prada

2018 ◽  
Author(s):  
Jean Gaudart ◽  
Pascal Adalian ◽  
George Leonetti

AbstractIntroductionIn many studies, covariates are not always fully observed because of missing data process. Usually, subjects with missing data are excluded from the analysis but the number of covariates can be greater than the size of the sample when the number of removed subjects is high. Subjective selection or imputation procedures are used but this leads to biased or powerless models.The aim of our study was to develop a method based on the selection of the nearest covariate to the centroid of a homogeneous cluster of covariates. We applied this method to a forensic medicine data set to estimate the age of aborted fetuses.AnalysisMethodsWe measured 46 biometric covariates on 50 aborted fetuses. But the covariates were complete for only 18 fetuses.First, to obtain homogeneous clusters of covariates we used a hierarchical cluster analysis.Second, for each obtained cluster we selected the nearest covariate to the centroid of the cluster, maximizing the sum of correlations (the centroid criterion).Third, with the covariate selected this way, the sample size was sufficient to compute a classical linear regression model.We have shown the almost sure convergence of the centroid criterion and simulations were performed to build its empirical distribution.We compared our method to a subjective deletion method, two simple imputation methods and to the multiple imputation method.ResultsThe hierarchical cluster analysis built 2 clusters of covariates and 6 remaining covariates. After the selection of the nearest covariate to the centroid of each cluster, we computed a stepwise linear regression model. The model was adequate (R2=90.02%) and the cross-validation showed low prediction errors (2.23 10−3).The empirical distribution of the criterion provided empirical mean (31.91) and median (32.07) close to the theoretical value (32.03).The comparisons showed that deletion and simple imputation methods provided models of inferior quality than the multiple imputation method and the centroid method.ConclusionWhen the number of continuous covariates is greater than the sample size because of missing process, the usual procedures are biased. Our selection procedure based on the centroid criterion is a valid alternative to compose a set of predictors.


2014 ◽  
Vol 30 (1) ◽  
pp. 147-161 ◽  
Author(s):  
Taylor Lewis ◽  
Elizabeth Goldberg ◽  
Nathaniel Schenker ◽  
Vladislav Beresovsky ◽  
Susan Schappert ◽  
...  

Abstract The National Ambulatory Medical Care Survey collects data on office-based physician care from a nationally representative, multistage sampling scheme where the ultimate unit of analysis is a patient-doctor encounter. Patient race, a commonly analyzed demographic, has been subject to a steadily increasing item nonresponse rate. In 1999, race was missing for 17 percent of cases; by 2008, that figure had risen to 33 percent. Over this entire period, single imputation has been the compensation method employed. Recent research at the National Center for Health Statistics evaluated multiply imputing race to better represent the missing-data uncertainty. Given item nonresponse rates of 30 percent or greater, we were surprised to find many estimates’ ratios of multiple-imputation to single-imputation estimated standard errors close to 1. A likely explanation is that the design effects attributable to the complex sample design largely outweigh any increase in variance attributable to missing-data uncertainty.


Sign in / Sign up

Export Citation Format

Share Document