Missing data completion method based on KNN and random forest

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Random Forest Missing Data Imputation Methods: Implications for Predicting At-Risk Students

Advances in Intelligent Systems and Computing - Intelligent Systems Design and Applications ◽

10.1007/978-3-030-49342-4_29 ◽

2020 ◽

pp. 298-308

Author(s):

Bevan I. Smith ◽

Charles Chimedza ◽

Jacoba H. Bührmann

Keyword(s):

At Risk ◽

Missing Data ◽

Random Forest ◽

At Risk Students ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods

Download Full-text

Investigating the Performance of cart- and Random Forest- Based Procedures for Dealing with Longitudinal Dropout in Small Sample Designs under mnar Missing Data

Longitudinal Multivariate Psychology ◽

10.4324/9781315160542-11 ◽

2018 ◽

pp. 212-239 ◽

Cited By ~ 1

Author(s):

Timothy Hayes

Keyword(s):

Missing Data ◽

Random Forest ◽

Small Sample

Download Full-text

Baseline distribution optimization and missing data completion in wavelet-based CS-TomoSAR

Science China Information Sciences ◽

10.1007/s11432-016-9068-y ◽

2017 ◽

Vol 61 (4) ◽

Cited By ~ 2

Author(s):

Hui Bi ◽

Jianguo Liu ◽

Bingchen Zhang ◽

Wen Hong

Keyword(s):

Missing Data ◽

Data Completion

Download Full-text

Clustering-Based Multiple Imputation via Gray Relational Analysis for Missing Data and Its Application to Aerospace Field

The Scientific World JOURNAL ◽

10.1155/2013/720392 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Jing Tian ◽

Bing Yu ◽

Dan Yu ◽

Shilong Ma

Keyword(s):

Missing Data ◽

Data Quality ◽

Knowledge Discovery ◽

Multiple Imputation ◽

Industrial Applications ◽

Gray Relational Analysis ◽

Missing Value ◽

Similarity Metric ◽

Relational Analysis ◽

Data Completion

A large number of scientific researches and industrial applications commonly suffer from missing data. Some inappropriate techniques of missing value treatment compromise data quality, which detrimentally influences the knowledge discovery. In this paper, we propose a missing data completion method named CBGMI. Firstly, it separates the nonmissing data instances into several clusters by excluding the missing-valued entries. Then, it utilizes the entropy of the proximal category for each incomplete instance in terms of the similarity metric based on gray relational analysis. Experiments on UCI datasets and aerospace datasets demonstrate that the superiority of our algorithm to other approaches on validity.

Download Full-text

Effect of the Temporal Gradient of Vegetation Indices on Early-Season Wheat Classification Using the Random Forest Classifier

Applied Sciences ◽

10.3390/app8081216 ◽

2018 ◽

Vol 8 (8) ◽

pp. 1216 ◽

Cited By ~ 5

Author(s):

Mousa Abad ◽

Ali Abkar ◽

Barat Mojaradi

Keyword(s):

Missing Data ◽

Random Forest ◽

Winter Wheat ◽

Temporal Resolution ◽

Decision Makers ◽

Wheat Crop ◽

Early Season ◽

Temporal Gradient ◽

Spectral Bands ◽

Multi Temporal

Early-season area estimation of the winter wheat crop as a strategic product is important for decision-makers. Multi-temporal images are the best tool to measure early-season winter wheat crops, but there are issues with classification. Classification of multi-temporal images is affected by factors such as training sample size, temporal resolution, vegetation index (VI) type, temporal gradient of spectral bands and VIs, classifiers, and values missed under cloudy conditions. This study addresses the effect of the temporal resolution and VIs, along with the spectral and VIs gradient on the random forest (RF) classifier when missing data occurs in multi-temporal images. To investigate the appropriate temporal resolution for image acquisition, a study area is selected on an overlapping area between two Landsat Data Continuity Mission (LDCM) paths. In the proposed method, the missing data from cloudy pixels are retrieved using the average of the k-nearest cloudless pixels in the feature space. Next, multi-temporal image analysis is performed by considering different scenarios provided by decision-makers for the desired crop types, which should be extracted early in the season in the study areas. The classification results obtained by RF improved by 2.2% when the temporally-missing data were retrieved using the proposed method. Moreover, the experimental results demonstrated that when the temporal resolution of Landsat-8 is increased to one week, the classification task can be conducted earlier with slightly better overall accuracy (OA) and kappa values. The effect of incorporating VIs along with the temporal gradients of spectral bands and VIs into the RF classifier improved the OA by 3.1% and the kappa value by 6.6%, on average. The results show that if only three optimum images from seasonal changes in crops are available, the temporal gradient of the VIs and spectral bands becomes the primary tool available for discriminating wheat from barley. The results also showed that if wheat and barley are considered as single class versus other classes, with the use of images associated with 162 and 163 paths, both crops can be classified in March (at the beginning of the growth stage) with an overall accuracy of 97.1% and kappa coefficient of 93.5%.

Download Full-text

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

American Journal of Epidemiology ◽

10.1093/aje/kwt312 ◽

2014 ◽

Vol 179 (6) ◽

pp. 764-774 ◽

Cited By ~ 140

Author(s):

Anoop D. Shah ◽

Jonathan W. Bartlett ◽

James Carpenter ◽

Owen Nicholas ◽

Harry Hemingway

Keyword(s):

Missing Data ◽

Random Forest

Download Full-text

Joint Matrix Decomposition-Based Missing Data Completion in Low-Voltage Area

Mathematical Problems in Engineering ◽

10.1155/2021/4170064 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Haowen Wu ◽

Chen Yang ◽

Wenwang Xie ◽

Wei Zhang

Keyword(s):

Missing Data ◽

Low Voltage ◽

Learning Algorithm ◽

Matrix Decomposition ◽

Current Data ◽

Power Grids ◽

Data Completion ◽

Completion Effect ◽

Better Than ◽

Area Data

In-depth mining and analysis of electricity data in low-voltage area are essential for the further intelligent development of power grids. However, in the actual data collection and measurement of low-voltage area, there will be missing data, and complete electricity data cannot be obtained. To obtain complete power data, this paper proposes a low-voltage station area missing data complement model based on joint matrix decomposition. First, we analyse the characteristics of the low-pressure station data. Then, a model that comprehensively considers the characteristics of the low-voltage station area data is proposed, which includes three parts: the construction of a low-voltage station area data tensor, the joint matrix decomposition, and the completion of the missing data, and it is named LPZ. After that, the CIM learning algorithm proposed in this paper is used to iteratively solve the model to obtain the completed data. Finally, the method proposed in this paper is used to complement the two situations of random loss and all-day loss of real current data in a low-voltage station area and compared with the traditional complement method. The experimental results show that this method is not only effective but also that the completion effect is better than that of other completion methods.

Download Full-text

MISSING DATA IMPUTATION ON IOT SENSOR NETWORKS: IMPLICATIONS FOR ON-SITE SENSOR CALIBRATION

10.36227/techrxiv.13633529 ◽

2021 ◽

Author(s):

Nwamaka Okafor

Keyword(s):

Sensor Networks ◽

Missing Data ◽

Random Forest ◽

Data Collection ◽

Missing Values ◽

Imputation Accuracy ◽

Sensor Calibration ◽

Concentration Data ◽

Missing Data Imputation

IoT sensors are gaining more popularity in the environmental monitoring space due to their relatively small size, cost of acquisition and ease of installation and operation. They are becoming increasingly important<br>supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. Several imputation strategies have been proposed for handling missing values in various application domains. This paper examines the performance of different imputation techniques including Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for handling missing values on sensor networks deployed for the quantification of Green House Gases(GHGs). Two tasks were conducted: first, Ozone (O3) and NO2/O3 concentration data collected using Aeroqual and Cairclip sensors respectively over a six months data collection period were corrupted by removing data intervals at different missing periods (p) where p 2 f1day; 1week; 2weeks; 1monthg and also at random points on the dataset at varying proportion (r) where r 2 f5%; 10%; 30%; 50%; 70%g. The missing data were then filled using the different imputation strategies and their imputation accuracy calculated. Second, the performance of sensor calibration by different regression models including Multi Linear Regression (MLR), Decision Tree (DT), Random Forest (RF) and XGBoost (XGB) trained on the different imputed datasets were evaluated. The analysis showed the MICE technique to outperform the others in imputing the missing values on both the O3 and NO2/O3 datasets when missingness was introduced over periods p. MissForest, however, outperformed the rest when missingness was introduced as randomly occuring point errors. While the analysis demonstrated the effects of missing and imputed data on sensor calibration, experimental results showed that a simple model on the imputed dataset can achieve state of-the-art result on in-situ sensor calibration, improving the data quality of the sensor.

Download Full-text

Handling missing data in a rheumatoid arthritis registry using random forest approach

International Journal of Rheumatic Diseases ◽

10.1111/1756-185x.14203 ◽

2021 ◽

Author(s):

Ahmad Alsaber ◽

Adeeba Al‐Herz ◽

Jiazhu Pan ◽

Ahmad T. AL‐Sultan ◽

Divya Mishra ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Missing Data ◽

Random Forest

Download Full-text