Analysis and Impact Evaluation of Missing Data Imputation in Day-ahead PV Generation Forecasting

Over the past decade, PV power plants have increasingly contributed to power generation. However, PV power generation widely varies due to environmental factors; thus, the accurate forecasting of PV generation becomes essential. Meanwhile, weather data for environmental factors include many missing values; for example, when we estimated the missing values in the precipitation data of the Korea Meteorological Agency, they amounted to ~16% from 2015–2016, and further, 19% of the weather data were missing for 2017. Such missing values deteriorate the PV power generation prediction performance, and they need to be eliminated by filling in other values. Here, we explore the impact of missing data imputation methods that can be used to replace these missing values. We apply four missing data imputation methods to the training data and test data of the prediction model based on support vector regression. When the k-nearest neighbors method is applied to the test data, the prediction performance yields results closest to those for the original data with no missing values, and the prediction model’s performance is stable even when the missing data rate increases. Therefore, we conclude that the most appropriate missing data imputation for application to PV forecasting is the KNN method.

Download Full-text

Missing Data Imputation in Internet of Things Gateways

Information ◽

10.3390/info12100425 ◽

2021 ◽

Vol 12 (10) ◽

pp. 425

Author(s):

Cinthya M. França ◽

Rodrigo S. Couto ◽

Pedro B. Velloso

Keyword(s):

Neural Network ◽

Missing Data ◽

Internet Of Things ◽

Execution Time ◽

Regression Models ◽

Weather Data ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods ◽

The Neural Network

In an Internet of Things (IoT) environment, sensors collect and send data to application servers through IoT gateways. However, these data may be missing values due to networking problems or sensor malfunction, which reduces applications’ reliability. This work proposes a mechanism to predict and impute missing data in IoT gateways to achieve greater autonomy at the network edge. These gateways typically have limited computing resources. Therefore, the missing data imputation methods must be simple and provide good results. Thus, this work presents two regression models based on neural networks to impute missing data in IoT gateways. In addition to the prediction quality, we analyzed both the execution time and the amount of memory used. We validated our models using six years of weather data from Rio de Janeiro, varying the missing data percentages. The results show that the neural network regression models perform better than the other imputation methods analyzed, based on the averages and repetition of previous values, for all missing data percentages. In addition, the neural network models present a short execution time and need less than 140 KiB of memory, which allows them to run on IoT gateways.

Download Full-text

Evaluation of Missing Data Imputation Methods for an Enhanced Distributed PV Generation Prediction

Advances in Intelligent Systems and Computing - Proceedings of the Future Technologies Conference (FTC) 2019 ◽

10.1007/978-3-030-32520-6_43 ◽

2019 ◽

pp. 590-609 ◽

Cited By ~ 2

Author(s):

Aditya Sundararajan ◽

Arif I. Sarwat

Keyword(s):

Missing Data ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods ◽

Pv Generation

Download Full-text

Advanced methods for missing values imputation based on similarity learning

PeerJ Computer Science ◽

10.7717/peerj-cs.619 ◽

2021 ◽

Vol 7 ◽

pp. e619

Author(s):

Khaled M. Fouad ◽

Mahmoud M. Ismail ◽

Ahmad Taher Azar ◽

Mona M. Arafa

Keyword(s):

Missing Data ◽

Missing Values ◽

Imputation Accuracy ◽

Nearest Neighbors ◽

Imputation Method ◽

Data Imputation ◽

K Nearest Neighbors ◽

Missing Data Imputation ◽

K Value ◽

Imputation Methods

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.

Download Full-text

Data Imputation Methods for Missing Values in the Context of Clustering

Big Data and Knowledge Sharing in Virtual Organizations - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-5225-7519-1.ch011 ◽

2019 ◽

pp. 240-274

Author(s):

Mehmet S. Aktaş ◽

Sinan Kaplan ◽

Hasan Abacı ◽

Oya Kalipsiz ◽

Utku Ketenci ◽

...

Keyword(s):

Missing Data ◽

Expectation Maximization ◽

Missing Values ◽

Nearest Neighbor ◽

Real Life ◽

Data Imputation ◽

K Nearest Neighbor ◽

Missing Data Imputation ◽

Data Scarcity ◽

Imputation Methods

Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.

Download Full-text

Missing Data Imputation Method for Autism Prediction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d4551.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 940-944

Keyword(s):

Machine Learning ◽

Missing Data ◽

Missing Values ◽

Imputation Method ◽

Support Vector ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods ◽

Significant Difference ◽

Friedman's Test

Missing data imputation is essential task becauseremoving all records with missing values will discard useful information from other attributes. This paper estimates the performanceof prediction for autism dataset with imputed missing values. Statistical imputation methods like mean, imputation with zero or constant and machine learning imputation methods like K-nearest neighbour chained Equation methods were compared with the proposed deep learning imputation method. The predictions of patients with autistic spectrum disorder were measured using support vector machine for imputed dataset. Among the imputation methods, Deeplearningalgorithm outperformed statistical and machine learning imputation methods. The same is validated using significant difference in p values revealed using Friedman’s test

Download Full-text

Kernel weighted least square approach for imputing missing values of metabolomics data

Scientific Reports ◽

10.1038/s41598-021-90654-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Masahiro Sugimoto

Keyword(s):

Missing Data ◽

Large Scale ◽

Missing Values ◽

Kernel Weight ◽

Least Square ◽

Data Matrix ◽

Data Imputation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Data Imputation

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Download Full-text

Random Forest Missing Data Imputation Methods: Implications for Predicting At-Risk Students

Advances in Intelligent Systems and Computing - Intelligent Systems Design and Applications ◽

10.1007/978-3-030-49342-4_29 ◽

2020 ◽

pp. 298-308

Author(s):

Bevan I. Smith ◽

Charles Chimedza ◽

Jacoba H. Bührmann

Keyword(s):

At Risk ◽

Missing Data ◽

Random Forest ◽

At Risk Students ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods

Download Full-text

Reviewing Autoencoders for Missing Data Imputation: Technical Trends, Applications and Outcomes

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12312 ◽

2020 ◽

Vol 69 ◽

pp. 1255-1285

Author(s):

Ricardo Cardoso Pereira ◽

Miriam Seoane Santos ◽

Pedro Pereira Rodrigues ◽

Pedro Henriques Abreu

Keyword(s):

Missing Data ◽

Missing Values ◽

State Of The Art ◽

Data Imputation ◽

Tabular Data ◽

Missing Data Imputation ◽

Learning Techniques ◽

Real World Datasets ◽

And Training ◽

Machine Learning Models

Missing data is a problem often found in real-world datasets and it can degrade the performance of most machine learning models. Several deep learning techniques have been used to address this issue, and one of them is the Autoencoder and its Denoising and Variational variants. These models are able to learn a representation of the data with missing values and generate plausible new ones to replace them. This study surveys the use of Autoencoders for the imputation of tabular data and considers 26 works published between 2014 and 2020. The analysis is mainly focused on discussing patterns and recommendations for the architecture, hyperparameters and training settings of the network, while providing a detailed discussion of the results obtained by Autoencoders when compared to other state-of-the-art methods, and of the data contexts where they have been applied. The conclusions include a set of recommendations for the technical settings of the network, and show that Denoising Autoencoders outperform their competitors, particularly the often used statistical methods.

Download Full-text

SICE: an improved missing data imputation technique

Journal Of Big Data ◽

10.1186/s40537-020-00313-w ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Shahidul Islam Khan ◽

Abu Sayed Md Latiful Hoque

Keyword(s):

Missing Data ◽

Binary Data ◽

Missing Values ◽

Hybrid Approach ◽

Data Imputation ◽

Missing Data Imputation ◽

Wrong Prediction ◽

A New Technique ◽

Numeric Data ◽

Public Datasets

Abstract In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.

Download Full-text