Application of Sequential Regression Multivariate Imputation Method on Multivariate Normal Missing Data

Author(s):  
Nurzaman ◽  
Titin Siswantining ◽  
Saskya Mary Soemartojo ◽  
Devvi Sarwinda
Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


Author(s):  
Caio Ribeiro ◽  
Alex A. Freitas

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.


Symmetry ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1792
Author(s):  
Shu-Fen Huang ◽  
Ching-Hsue Cheng

Medical data usually have missing values; hence, imputation methods have become an important issue. In previous studies, many imputation methods based on variable data had a multivariate normal distribution, such as expectation-maximization and regression-based imputation. These assumptions may lead to deviations in the results, which sometimes create a bottleneck. In addition, directly deleting instances with missing values may have several problems, such as losing important data, producing invalid research samples, and leading to research deviations. Therefore, this study proposed a safe-region imputation method for handling medical data with missing values; we also built a medical prediction model and compared the removed missing values with imputation methods in terms of the generated rules, accuracy, and AUC. First, this study used the kNN imputation, multiple imputation, and the proposed imputation to impute the missing data and then applied four attribute selection methods to select the important attributes. Then, we used the decision tree (C4.5), random forest, REP tree, and LMT classifier to generate the rules, accuracy, and AUC for comparison. Because there were four datasets with imbalanced classes (asymmetric classes), the AUC was an important criterion. In the experiment, we collected four open medical datasets from UCI and one international stroke trial dataset. The results show that the proposed safe-region imputation is better than the listing imputation methods and after imputing offers better results than directly deleting instances with missing values in the number of rules, accuracy, and AUC. These results will provide a reference for medical stakeholders.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5947
Author(s):  
Liang Zhang

Building operation data are important for monitoring, analysis, modeling, and control of building energy systems. However, missing data is one of the major data quality issues, making data imputation techniques become increasingly important. There are two key research gaps for missing sensor data imputation in buildings: the lack of customized and automated imputation methodology, and the difficulty of the validation of data imputation methods. In this paper, a framework is developed to address these two gaps. First, a validation data generation module is developed based on pattern recognition to create a validation dataset to quantify the performance of data imputation methods. Second, a pool of data imputation methods is tested under the validation dataset to find an optimal single imputation method for each sensor, which is termed as an ensemble method. The method can reflect the specific mechanism and randomness of missing data from each sensor. The effectiveness of the framework is demonstrated by 18 sensors from a real campus building. The overall accuracy of data imputation for those sensors improves by 18.2% on average compared with the best single data imputation method.


2010 ◽  
Vol 2010 ◽  
pp. 1-14 ◽  
Author(s):  
Shang Zhaowei ◽  
Zhang Lingfeng ◽  
Ma Shangjun ◽  
Fang Bin ◽  
Zhang Taiping

This paper discusses the prediction of time series with missing data. A novel forecast model is proposed based on max-margin classification of data with absent features. The issue of modeling incomplete time series is considered as classification of data with absent features. We employ the optimal hyperplane of classification to predict the future values. Compared with traditional predicting process of incomplete time series, our method solves the problem directly rather than fills the missing data in advance. In addition, we introduce an imputation method to estimate the missing data in the history series. Experimental results validate the effectiveness of our model in both prediction and imputation.


Author(s):  
Xu Wang ◽  
Yuechun Ge ◽  
Lei Niu ◽  
Yi He ◽  
Tony Z. Qiu

Real-time traffic control systems are widely implemented on roadways around the world as a measure to improve freeway mobility. However, the systems, which rely on data from road-side and on-road sensors and other electronic equipment, continue to suffer from issues related to missing and erroneous data. While many data imputation methods are documented in the related literature, traffic control systems still lack an imputation method that is applicable in practice, accurate in imputation, and simple in computation. In response, this paper puts forth a linear imputation model that considers both temporal traffic trend and spatial detector correlations. To adapt the model to dynamic traffic variations, the imputation method was equipped with an online calibration module. The proposed imputation method was evaluated with field data from two stations on the Whitemud Drive, a busy urban freeway in Edmonton, Alberta, Canada. The proposed model benefited from its time-of-day temporal trend and outperforms the previous model that considers only spatial correlations. Moreover, the online calibration module was effective in improving imputation accuracy. Finally, the sensitivity of imputation performance was analyzed. The results show that the imputation with online calibration is more sensitive to missing data ratios than that with offline calibration. The sensitivity analysis revealed that imputation with online calibration is more suitable for online imputation in traffic control implementations.


2018 ◽  
Author(s):  
Jean Gaudart ◽  
Pascal Adalian ◽  
George Leonetti

AbstractIntroductionIn many studies, covariates are not always fully observed because of missing data process. Usually, subjects with missing data are excluded from the analysis but the number of covariates can be greater than the size of the sample when the number of removed subjects is high. Subjective selection or imputation procedures are used but this leads to biased or powerless models.The aim of our study was to develop a method based on the selection of the nearest covariate to the centroid of a homogeneous cluster of covariates. We applied this method to a forensic medicine data set to estimate the age of aborted fetuses.AnalysisMethodsWe measured 46 biometric covariates on 50 aborted fetuses. But the covariates were complete for only 18 fetuses.First, to obtain homogeneous clusters of covariates we used a hierarchical cluster analysis.Second, for each obtained cluster we selected the nearest covariate to the centroid of the cluster, maximizing the sum of correlations (the centroid criterion).Third, with the covariate selected this way, the sample size was sufficient to compute a classical linear regression model.We have shown the almost sure convergence of the centroid criterion and simulations were performed to build its empirical distribution.We compared our method to a subjective deletion method, two simple imputation methods and to the multiple imputation method.ResultsThe hierarchical cluster analysis built 2 clusters of covariates and 6 remaining covariates. After the selection of the nearest covariate to the centroid of each cluster, we computed a stepwise linear regression model. The model was adequate (R2=90.02%) and the cross-validation showed low prediction errors (2.23 10−3).The empirical distribution of the criterion provided empirical mean (31.91) and median (32.07) close to the theoretical value (32.03).The comparisons showed that deletion and simple imputation methods provided models of inferior quality than the multiple imputation method and the centroid method.ConclusionWhen the number of continuous covariates is greater than the sample size because of missing process, the usual procedures are biased. Our selection procedure based on the centroid criterion is a valid alternative to compose a set of predictors.


Sign in / Sign up

Export Citation Format

Share Document