Missing data completion method based on KNN and random forest

2021 ◽  
Author(s):  
Songyu Zhang ◽  
Yuchen Zhou ◽  
Jinghua Yan ◽  
Fanliang Bu
Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Jing Tian ◽  
Bing Yu ◽  
Dan Yu ◽  
Shilong Ma

A large number of scientific researches and industrial applications commonly suffer from missing data. Some inappropriate techniques of missing value treatment compromise data quality, which detrimentally influences the knowledge discovery. In this paper, we propose a missing data completion method named CBGMI. Firstly, it separates the nonmissing data instances into several clusters by excluding the missing-valued entries. Then, it utilizes the entropy of the proximal category for each incomplete instance in terms of the similarity metric based on gray relational analysis. Experiments on UCI datasets and aerospace datasets demonstrate that the superiority of our algorithm to other approaches on validity.


2018 ◽  
Vol 8 (8) ◽  
pp. 1216 ◽  
Author(s):  
Mousa Abad ◽  
Ali Abkar ◽  
Barat Mojaradi

Early-season area estimation of the winter wheat crop as a strategic product is important for decision-makers. Multi-temporal images are the best tool to measure early-season winter wheat crops, but there are issues with classification. Classification of multi-temporal images is affected by factors such as training sample size, temporal resolution, vegetation index (VI) type, temporal gradient of spectral bands and VIs, classifiers, and values missed under cloudy conditions. This study addresses the effect of the temporal resolution and VIs, along with the spectral and VIs gradient on the random forest (RF) classifier when missing data occurs in multi-temporal images. To investigate the appropriate temporal resolution for image acquisition, a study area is selected on an overlapping area between two Landsat Data Continuity Mission (LDCM) paths. In the proposed method, the missing data from cloudy pixels are retrieved using the average of the k-nearest cloudless pixels in the feature space. Next, multi-temporal image analysis is performed by considering different scenarios provided by decision-makers for the desired crop types, which should be extracted early in the season in the study areas. The classification results obtained by RF improved by 2.2% when the temporally-missing data were retrieved using the proposed method. Moreover, the experimental results demonstrated that when the temporal resolution of Landsat-8 is increased to one week, the classification task can be conducted earlier with slightly better overall accuracy (OA) and kappa values. The effect of incorporating VIs along with the temporal gradients of spectral bands and VIs into the RF classifier improved the OA by 3.1% and the kappa value by 6.6%, on average. The results show that if only three optimum images from seasonal changes in crops are available, the temporal gradient of the VIs and spectral bands becomes the primary tool available for discriminating wheat from barley. The results also showed that if wheat and barley are considered as single class versus other classes, with the use of images associated with 162 and 163 paths, both crops can be classified in March (at the beginning of the growth stage) with an overall accuracy of 97.1% and kappa coefficient of 93.5%.


2014 ◽  
Vol 179 (6) ◽  
pp. 764-774 ◽  
Author(s):  
Anoop D. Shah ◽  
Jonathan W. Bartlett ◽  
James Carpenter ◽  
Owen Nicholas ◽  
Harry Hemingway
Keyword(s):  

2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Haowen Wu ◽  
Chen Yang ◽  
Wenwang Xie ◽  
Wei Zhang

In-depth mining and analysis of electricity data in low-voltage area are essential for the further intelligent development of power grids. However, in the actual data collection and measurement of low-voltage area, there will be missing data, and complete electricity data cannot be obtained. To obtain complete power data, this paper proposes a low-voltage station area missing data complement model based on joint matrix decomposition. First, we analyse the characteristics of the low-pressure station data. Then, a model that comprehensively considers the characteristics of the low-voltage station area data is proposed, which includes three parts: the construction of a low-voltage station area data tensor, the joint matrix decomposition, and the completion of the missing data, and it is named LPZ. After that, the CIM learning algorithm proposed in this paper is used to iteratively solve the model to obtain the completed data. Finally, the method proposed in this paper is used to complement the two situations of random loss and all-day loss of real current data in a low-voltage station area and compared with the traditional complement method. The experimental results show that this method is not only effective but also that the completion effect is better than that of other completion methods.


2021 ◽  
Author(s):  
Nwamaka Okafor

IoT sensors are gaining more popularity in the environmental monitoring space due to their relatively small size, cost of acquisition and ease of installation and operation. They are becoming increasingly important<br>supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. Several imputation strategies have been proposed for handling missing values in various application domains. This paper examines the performance of different imputation techniques including Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for handling missing values on sensor networks deployed for the quantification of Green House Gases(GHGs). Two tasks were conducted: first, Ozone (O3) and NO2/O3 concentration data collected using Aeroqual and Cairclip sensors respectively over a six months data collection period were corrupted by removing data intervals at different missing periods (p) where p 2 f1day; 1week; 2weeks; 1monthg and also at random points on the dataset at varying proportion (r) where r 2 f5%; 10%; 30%; 50%; 70%g. The missing data were then filled using the different imputation strategies and their imputation accuracy calculated. Second, the performance of sensor calibration by different regression models including Multi Linear Regression (MLR), Decision Tree (DT), Random Forest (RF) and XGBoost (XGB) trained on the different imputed datasets were evaluated. The analysis showed the MICE technique to outperform the others in imputing the missing values on both the O3 and NO2/O3 datasets when missingness was introduced over periods p. MissForest, however, outperformed the rest when missingness was introduced as randomly occuring point errors. While the analysis demonstrated the effects of missing and imputed data on sensor calibration, experimental results showed that a simple model on the imputed dataset can achieve state of-the-art result on in-situ sensor calibration, improving the data quality of the sensor.


Author(s):  
Ahmad Alsaber ◽  
Adeeba Al‐Herz ◽  
Jiazhu Pan ◽  
Ahmad T. AL‐Sultan ◽  
Divya Mishra ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document