missing data imputation
Recently Published Documents


TOTAL DOCUMENTS

384
(FIVE YEARS 191)

H-INDEX

28
(FIVE YEARS 5)

2022 ◽  
Vol 14 (1) ◽  
pp. 0-0

Many real world datasets may contain missing values for various reasons. These incomplete datasets can pose severe issues to the underlying machine learning algorithms and decision support systems. It may result in high computational cost, skewed output and invalid deductions. Various solutions exist to mitigate this issue; the most popular strategy is to estimate the missing values by applying inferential techniques such as linear regression, decision trees or Bayesian inference. In this paper, the missing data problem is discussed in detail with a comprehensive review of the approaches to tackle it. The paper concludes with a discussion on the effectiveness of three imputation methods namely, imputation based on Multiple Linear Regression (MLR), Predictive Mean Matching (PMM) and Classification And Regression Tree (CART) in the context of subspace clustering. The experimental results obtained on real benchmark datasets and high-dimensional synthetic datasets highlight that, MLR based imputation method is more efficient on high-dimensional incomplete datasets.


2021 ◽  
Author(s):  
Yuanjun Li ◽  
Roland Horne ◽  
Ahmed Al Shmakhy ◽  
Tania Felix Menchaca

Abstract The problem of missing data is a frequent occurrence in well production history records. Due to network outage, facility maintenance or equipment failure, the time series production data measured from surface and downhole gauges can be intermittent. The fragmentary data are an obstacle for reservoir management. The incomplete dataset is commonly simplified by omitting all observations with missing values, which will lead to significant information loss. Thus, to fill the missing data gaps, in this study, we developed and tested several missing data imputation approaches using machine learning and deep learning methods. Traditional data imputation methods such as interpolation and counting most frequent values can introduce bias to the data as the correlations between features are not considered. Thus, in this study, we investigated several multivariate imputation algorithms that use the entire set of available data streams to estimate the missing values. The methods use a full suite of well measurements, including wellhead and downhole pressures, oil, water and gas flow rates, surface and downhole temperatures, choke settings, etc. Any parameter that has gaps in its recorded history can be imputed from the other available data streams. The models were tested on both synthetic and real datasets from operating Norwegian and Abu Dhabi reservoirs. Based on the characteristics of the field data, we introduced different types of continuous missing distributions, which are the combinations of single-multiple missing sections in a long-short time span, to the complete dataset. We observed that as the missing time span expands, the stability of the more successful methods can be kept to a threshold of 30% of the entire dataset. In addition, for a single missing section over a shorter period, which could represent a weather perturbation, most methods we tried were able to achieve high imputation accuracy. In the case of multiple missing sections over a longer time span, which is typical of gauge failures, other methods were better candidates to capture the overall correlation in the multivariate dataset. Most missing data problems addressed in our industry focus on single feature imputation. In this study, we developed an efficient procedure that enables fast reconstruction of the entire production dataset with multiple missing sections in different variables. Ultimately, the complete information can support the reservoir history matching process, production allocation, and develop models for reservoir performance prediction.


Author(s):  
C. V. S. R. Syavasya ◽  
M. A. Lakshmi

With the rapid explosion of the data streams from the applications, ensuring accurate data analysis is essential for effective real-time decision making. Nowadays, data stream applications often confront the missing values that affect the performance of the classification models. Several imputation models have adopted the deep learning algorithms for estimating the missing values; however, the lack of parameter and structure tuning in classification, degrade the performance for data imputation. This work presents the missing data imputation model using the adaptive deep incremental learning algorithm for streaming applications. The proposed approach incorporates two main processes: enhancing the deep incremental learning algorithm and enhancing deep incremental learning-based imputation. Initially, the proposed approach focuses on tuning the learning rate with both the Adaptive Moment Estimation (Adam) along with Stochastic Gradient Descent (SGD) optimizers and tuning the hidden neurons. Secondly, the proposed approach applies the enhanced deep incremental learning algorithm to estimate the imputed values in two steps: (i) imputation process to predict the missing values based on the temporal-proximity and (ii) generation of complete IoT dataset by imputing the missing values from both the predicted values. The experimental outcomes illustrate that the proposed imputation model effectively transforms the incomplete dataset into a complete dataset with minimal error.


Author(s):  
Daoming Wan ◽  
Roozbeh Razavi-Far ◽  
Mehrdad Saif ◽  
Niloofar Mozafari

2021 ◽  
Author(s):  
farah adibah adnan ◽  
Khairur Rijal Jamaludin ◽  
Wan Zuki Azman Wan Muhamad ◽  
Suraya Miskon

Abstract Missing value or sometimes synonym as missing data, is an unavoidable issue when collecting data. It is uncontrollable and happen in almost any research fields. Hence, this study focused on identifying the current publications trend on missing data imputation techniques (1991- 2021) specifically in classification problems using bibliometric analysis. Most importantly, this research aims to uncover the potential missing data imputation methods. Two software were used; VOSViewer and Harzing Publish or Perish. Based on the Scopus database extracted in June 2021, the findings indicate an emerging trend in missing data imputation research to date, while there are two imputation methods that get the most attention; the random forest and the nearest neighbor methods.


Sign in / Sign up

Export Citation Format

Share Document