scholarly journals Single Imputation Methods and Confidence Intervals for the Gini Index

Mathematics ◽  
2021 ◽  
Vol 9 (24) ◽  
pp. 3252
Author(s):  
Encarnación Álvarez-Verdejo ◽  
Pablo J. Moya-Fernández ◽  
Juan F. Muñoz-Rosas

The problem of missing data is a common feature in any study, and a single imputation method is often applied to deal with this problem. The first contribution of this paper is to analyse the empirical performance of some traditional single imputation methods when they are applied to the estimation of the Gini index, a popular measure of inequality used in many studies. Various methods for constructing confidence intervals for the Gini index are also empirically evaluated. We consider several empirical measures to analyse the performance of estimators and confidence intervals, allowing us to quantify the magnitude of the non-response bias problem. We find extremely large biases under certain non-response mechanisms, and this problem gets noticeably worse as the proportion of missing data increases. For a large correlation coefficient between the target and auxiliary variables, the regression imputation method may notably mitigate this bias problem, yielding appropriate mean square errors. We also find that confidence intervals have poor coverage rates when the probability of data being missing is not uniform, and that the regression imputation method substantially improves the handling of this problem as the correlation coefficient increases.

2021 ◽  
pp. 188-196 ◽  
Author(s):  
Lauren C. Benson ◽  
Carlyn Stilling ◽  
Oluwatoyosi B.A. Owoeye ◽  
Carolyn A. Emery

Missing data can influence calculations of accumulated athlete workload. The objectives were to identify the best single imputation methods and examine workload trends using multiple imputation. External (jumps per hour) and internal (rating of perceived exertion; RPE) workload were recorded for 93 (45 females, 48 males) high school basketball players throughout a season. Recorded data were simulated as missing and imputed using ten imputation methods based on the context of the individual, team and session. Both single imputation and machine learning methods were used to impute the simulated missing data. The difference between the imputed data and the actual workload values was computed as root mean squared error (RMSE). A generalized estimating equation determined the effect of imputation method on RMSE. Multiple imputation of the original dataset, with all known and actual missing workload data, was used to examine trends in longitudinal workload data. Following multiple imputation, a Pearson correlation evaluated the longitudinal association between jump count and sRPE over the season. A single imputation method based on the specific context of the session for which data are missing (team mean) was only outperformed by methods that combine information about the session and the individual (machine learning models). There was a significant and strong association between jump count and sRPE in the original data and imputed datasets using multiple imputation. The amount and nature of the missing data should be considered when choosing a method for single imputation of workload data in youth basketball. Multiple imputation using several predictor variables in a regression model can be used for analyses where workload is accumulated across an entire season.


2021 ◽  
Author(s):  
Rahibu A. Abassi ◽  
Amina S. Msengwa ◽  
Rocky R. J. Akarro

Abstract Background Clinical data are at risk of having missing or incomplete values for several reasons including patients’ failure to attend clinical measurements, wrong interpretations of measurements, and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission of missed observation during statistical analysis especially if a dataset is considerably small. The objective of this study is to compare several imputation methods in terms of efficiency in filling-in the missing data so as to increase the prediction and classification accuracy in breast cancer dataset. Methods Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, and multiple imputations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion The predictive mean matching imputation showed higher accuracy in estimating and replacing missing/incomplete data values in a real breast cancer dataset under the study. It is a more effective and good method to handle missing data in this scenario. We recommend to replace missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables, as it improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.


2010 ◽  
Vol 6 (3) ◽  
pp. 1-10 ◽  
Author(s):  
Shichao Zhang

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5947
Author(s):  
Liang Zhang

Building operation data are important for monitoring, analysis, modeling, and control of building energy systems. However, missing data is one of the major data quality issues, making data imputation techniques become increasingly important. There are two key research gaps for missing sensor data imputation in buildings: the lack of customized and automated imputation methodology, and the difficulty of the validation of data imputation methods. In this paper, a framework is developed to address these two gaps. First, a validation data generation module is developed based on pattern recognition to create a validation dataset to quantify the performance of data imputation methods. Second, a pool of data imputation methods is tested under the validation dataset to find an optimal single imputation method for each sensor, which is termed as an ensemble method. The method can reflect the specific mechanism and randomness of missing data from each sensor. The effectiveness of the framework is demonstrated by 18 sensors from a real campus building. The overall accuracy of data imputation for those sensors improves by 18.2% on average compared with the best single data imputation method.


2018 ◽  
Author(s):  
Jean Gaudart ◽  
Pascal Adalian ◽  
George Leonetti

AbstractIntroductionIn many studies, covariates are not always fully observed because of missing data process. Usually, subjects with missing data are excluded from the analysis but the number of covariates can be greater than the size of the sample when the number of removed subjects is high. Subjective selection or imputation procedures are used but this leads to biased or powerless models.The aim of our study was to develop a method based on the selection of the nearest covariate to the centroid of a homogeneous cluster of covariates. We applied this method to a forensic medicine data set to estimate the age of aborted fetuses.AnalysisMethodsWe measured 46 biometric covariates on 50 aborted fetuses. But the covariates were complete for only 18 fetuses.First, to obtain homogeneous clusters of covariates we used a hierarchical cluster analysis.Second, for each obtained cluster we selected the nearest covariate to the centroid of the cluster, maximizing the sum of correlations (the centroid criterion).Third, with the covariate selected this way, the sample size was sufficient to compute a classical linear regression model.We have shown the almost sure convergence of the centroid criterion and simulations were performed to build its empirical distribution.We compared our method to a subjective deletion method, two simple imputation methods and to the multiple imputation method.ResultsThe hierarchical cluster analysis built 2 clusters of covariates and 6 remaining covariates. After the selection of the nearest covariate to the centroid of each cluster, we computed a stepwise linear regression model. The model was adequate (R2=90.02%) and the cross-validation showed low prediction errors (2.23 10−3).The empirical distribution of the criterion provided empirical mean (31.91) and median (32.07) close to the theoretical value (32.03).The comparisons showed that deletion and simple imputation methods provided models of inferior quality than the multiple imputation method and the centroid method.ConclusionWhen the number of continuous covariates is greater than the sample size because of missing process, the usual procedures are biased. Our selection procedure based on the centroid criterion is a valid alternative to compose a set of predictors.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2016 ◽  
Vol 20 (6) ◽  
pp. 965-970 ◽  
Author(s):  
Karen E Lamb ◽  
Dana Lee Olstad ◽  
Cattram Nguyen ◽  
Catherine Milte ◽  
Sarah A McNaughton

AbstractObjectiveFFQs are a popular method of capturing dietary information in epidemiological studies and may be used to derive dietary exposures such as nutrient intake or overall dietary patterns and diet quality. As FFQs can involve large numbers of questions, participants may fail to respond to all questions, leaving researchers to decide how to deal with missing data when deriving intake measures. The aim of the present commentary is to discuss the current practice for dealing with item non-response in FFQs and to propose a research agenda for reporting and handling missing data in FFQs.ResultsSingle imputation techniques, such as zero imputation (assuming no consumption of the item) or mean imputation, are commonly used to deal with item non-response in FFQs. However, single imputation methods make strong assumptions about the missing data mechanism and do not reflect the uncertainty created by the missing data. This can lead to incorrect inference about associations between diet and health outcomes. Although the use of multiple imputation methods in epidemiology has increased, these have seldom been used in the field of nutritional epidemiology to address missing data in FFQs. We discuss methods for dealing with item non-response in FFQs, highlighting the assumptions made under each approach.ConclusionsResearchers analysing FFQs should ensure that missing data are handled appropriately and clearly report how missing data were treated in analyses. Simulation studies are required to enable systematic evaluation of the utility of various methods for handling item non-response in FFQs under different assumptions about the missing data mechanism.


2015 ◽  
Vol 754-755 ◽  
pp. 923-932 ◽  
Author(s):  
Norazian Mohamed Noor ◽  
A.S. Yahaya ◽  
N.A. Ramli ◽  
Mohd Mustafa Al Bakri Abdullah

Hourly measured PM10 concentration at eight monitoring stations within peninsular Malaysia in 2006 was used to conduct the simulated missing data. The gap lengths of the simulated missing values are limited to 12 hours since the actual trend of missingness is considered short. Two percentages of simulated missing gaps were generated that are 5 % and 15 %. A number of single imputation methods (linear interpolation (LI), nearest neighbour interpolation (NN), mean above below (MAB), daily mean (DM), mean 12-hour (12M), mean 6-hour (6M), row mean (RM) and previous year (PY)) were calculated to fill in the simulated missing data. In addition, multiple imputation (MI) was also conducted to compare between the single imputation methods. The performances were evaluated using four statistical criteria namely mean absolute error, root mean squared error, prediction accuracy and index of agreement. The results show that 6M perform comparably well to LI. Thus, this show that the effect of smaller averaging time gives better prediction. Other single imputation methods predict the missing data well except for PY. RM and MI performs moderately with the increasing performance in higher fraction of missing gaps whereas LR makes the worst methods for both simulated missing data percentages.


Sign in / Sign up

Export Citation Format

Share Document