Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Random Forest Missing Data Imputation Methods: Implications for Predicting At-Risk Students

Advances in Intelligent Systems and Computing - Intelligent Systems Design and Applications ◽

10.1007/978-3-030-49342-4_29 ◽

2020 ◽

pp. 298-308

Author(s):

Bevan I. Smith ◽

Charles Chimedza ◽

Jacoba H. Bührmann

Keyword(s):

At Risk ◽

Missing Data ◽

Random Forest ◽

At Risk Students ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods

Download Full-text

Investigating the Performance of cart- and Random Forest- Based Procedures for Dealing with Longitudinal Dropout in Small Sample Designs under mnar Missing Data

Longitudinal Multivariate Psychology ◽

10.4324/9781315160542-11 ◽

2018 ◽

pp. 212-239 ◽

Cited By ~ 1

Author(s):

Timothy Hayes

Keyword(s):

Missing Data ◽

Random Forest ◽

Small Sample

Download Full-text

Effect of the Temporal Gradient of Vegetation Indices on Early-Season Wheat Classification Using the Random Forest Classifier

Applied Sciences ◽

10.3390/app8081216 ◽

2018 ◽

Vol 8 (8) ◽

pp. 1216 ◽

Cited By ~ 5

Author(s):

Mousa Abad ◽

Ali Abkar ◽

Barat Mojaradi

Keyword(s):

Missing Data ◽

Random Forest ◽

Winter Wheat ◽

Temporal Resolution ◽

Decision Makers ◽

Wheat Crop ◽

Early Season ◽

Temporal Gradient ◽

Spectral Bands ◽

Multi Temporal

Early-season area estimation of the winter wheat crop as a strategic product is important for decision-makers. Multi-temporal images are the best tool to measure early-season winter wheat crops, but there are issues with classification. Classification of multi-temporal images is affected by factors such as training sample size, temporal resolution, vegetation index (VI) type, temporal gradient of spectral bands and VIs, classifiers, and values missed under cloudy conditions. This study addresses the effect of the temporal resolution and VIs, along with the spectral and VIs gradient on the random forest (RF) classifier when missing data occurs in multi-temporal images. To investigate the appropriate temporal resolution for image acquisition, a study area is selected on an overlapping area between two Landsat Data Continuity Mission (LDCM) paths. In the proposed method, the missing data from cloudy pixels are retrieved using the average of the k-nearest cloudless pixels in the feature space. Next, multi-temporal image analysis is performed by considering different scenarios provided by decision-makers for the desired crop types, which should be extracted early in the season in the study areas. The classification results obtained by RF improved by 2.2% when the temporally-missing data were retrieved using the proposed method. Moreover, the experimental results demonstrated that when the temporal resolution of Landsat-8 is increased to one week, the classification task can be conducted earlier with slightly better overall accuracy (OA) and kappa values. The effect of incorporating VIs along with the temporal gradients of spectral bands and VIs into the RF classifier improved the OA by 3.1% and the kappa value by 6.6%, on average. The results show that if only three optimum images from seasonal changes in crops are available, the temporal gradient of the VIs and spectral bands becomes the primary tool available for discriminating wheat from barley. The results also showed that if wheat and barley are considered as single class versus other classes, with the use of images associated with 162 and 163 paths, both crops can be classified in March (at the beginning of the growth stage) with an overall accuracy of 97.1% and kappa coefficient of 93.5%.

Download Full-text

MISSING DATA IMPUTATION ON IOT SENSOR NETWORKS: IMPLICATIONS FOR ON-SITE SENSOR CALIBRATION

10.36227/techrxiv.13633529 ◽

2021 ◽

Author(s):

Nwamaka Okafor

Keyword(s):

Sensor Networks ◽

Missing Data ◽

Random Forest ◽

Data Collection ◽

Missing Values ◽

Imputation Accuracy ◽

Sensor Calibration ◽

Concentration Data ◽

Missing Data Imputation

IoT sensors are gaining more popularity in the environmental monitoring space due to their relatively small size, cost of acquisition and ease of installation and operation. They are becoming increasingly important<br>supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. Several imputation strategies have been proposed for handling missing values in various application domains. This paper examines the performance of different imputation techniques including Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for handling missing values on sensor networks deployed for the quantification of Green House Gases(GHGs). Two tasks were conducted: first, Ozone (O3) and NO2/O3 concentration data collected using Aeroqual and Cairclip sensors respectively over a six months data collection period were corrupted by removing data intervals at different missing periods (p) where p 2 f1day; 1week; 2weeks; 1monthg and also at random points on the dataset at varying proportion (r) where r 2 f5%; 10%; 30%; 50%; 70%g. The missing data were then filled using the different imputation strategies and their imputation accuracy calculated. Second, the performance of sensor calibration by different regression models including Multi Linear Regression (MLR), Decision Tree (DT), Random Forest (RF) and XGBoost (XGB) trained on the different imputed datasets were evaluated. The analysis showed the MICE technique to outperform the others in imputing the missing values on both the O3 and NO2/O3 datasets when missingness was introduced over periods p. MissForest, however, outperformed the rest when missingness was introduced as randomly occuring point errors. While the analysis demonstrated the effects of missing and imputed data on sensor calibration, experimental results showed that a simple model on the imputed dataset can achieve state of-the-art result on in-situ sensor calibration, improving the data quality of the sensor.

Download Full-text

Handling missing data in a rheumatoid arthritis registry using random forest approach

International Journal of Rheumatic Diseases ◽

10.1111/1756-185x.14203 ◽

2021 ◽

Author(s):

Ahmad Alsaber ◽

Adeeba Al‐Herz ◽

Jiazhu Pan ◽

Ahmad T. AL‐Sultan ◽

Divya Mishra ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Missing Data ◽

Random Forest

Download Full-text

Missing data completion method based on KNN and random forest

10.1117/12.2622876 ◽

2021 ◽

Author(s):

Songyu Zhang ◽

Yuchen Zhou ◽

Jinghua Yan ◽

Fanliang Bu

Keyword(s):

Missing Data ◽

Random Forest ◽

Data Completion

Download Full-text

Improvement of random forest by multiple imputation applied to tower crane accident prediction with missing data

Engineering Construction & Architectural Management ◽

10.1108/ecam-07-2021-0606 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Ling Jiang ◽

Tingsheng Zhao ◽

Chuxuan Feng ◽

Wei Zhang

Keyword(s):

Missing Data ◽

Random Forest ◽

Prediction Model ◽

Multiple Imputation ◽

Incomplete Data ◽

Missing Values ◽

Critical Factors ◽

Content Type ◽

Accident Prediction ◽

Tower Crane

PurposeThis research is aimed at predicting tower crane accident phases with incomplete data.Design/methodology/approachThe tower crane accidents are collected for prediction model training. Random forest (RF) is used to conduct prediction. When there are missing values in the new inputs, they should be filled in advance. Nevertheless, it is difficult to collect complete data on construction site. Thus, the authors use multiple imputation (MI) method to improve RF. Finally the prediction model is applied to a case study.FindingsThe results show that multiple imputation RF (MIRF) can effectively predict tower crane accident when the data are incomplete. This research provides the importance rank of tower crane safety factors. The critical factors should be focused on site, because the missing data affect the prediction results seriously. Also the value of critical factors influences the safety of tower crane.Practical implicationThis research promotes the application of machine learning methods for accident prediction in actual projects. According to the onsite data, the authors can predict the accident phase of tower crane. The results can be used for tower crane accident prevention.Originality/valuePrevious studies have seldom predicted tower crane accidents, especially the phase of accident. This research uses tower crane data collected on site to predict the phase of the tower crane accident. The incomplete data collection is considered in this research according to the actual situation.

Download Full-text