data imputation
Recently Published Documents


TOTAL DOCUMENTS

676
(FIVE YEARS 325)

H-INDEX

30
(FIVE YEARS 8)

2022 ◽  
Vol 16 (4) ◽  
pp. 1-24
Author(s):  
Kui Yu ◽  
Yajing Yang ◽  
Wei Ding

Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically independent of the class variable conditioning on its MB, this enables causal feature selection to identify potential causal features for feature selection for building robust and physically meaningful prediction models. Missing data, ubiquitous in many real-world applications, remain an open research problem in causal feature selection due to its technical complexity. In this article, we discuss a novel multiple imputation MB (MimMB) framework for causal feature selection with missing data. MimMB integrates Data Imputation with MB Learning in a unified framework to enable the two key components to engage with each other. MB Learning enables Data Imputation in a potentially causal feature space for achieving accurate data imputation, while accurate Data Imputation helps MB Learning identify a reliable MB of the class variable in turn. Then, we further design an enhanced kNN estimator for imputing missing values and instantiate the MimMB. In our comprehensively experimental evaluation, our new approach can effectively learn the MB of a given variable in a Bayesian network and outperforms other rival algorithms using synthetic and real-world datasets.


2021 ◽  
Author(s):  
Srinath Yelchuri ◽  
A. Rangaraj ◽  
Yu Xie ◽  
Aron Habte ◽  
Mohit Joshi ◽  
...  

2021 ◽  
Author(s):  
Panyi Wei ◽  
Jianling Huang ◽  
Yanyan Chen ◽  
Ronggui Zhou ◽  
Jiyang Sun

2021 ◽  
Author(s):  
Yuanjun Li ◽  
Roland Horne ◽  
Ahmed Al Shmakhy ◽  
Tania Felix Menchaca

Abstract The problem of missing data is a frequent occurrence in well production history records. Due to network outage, facility maintenance or equipment failure, the time series production data measured from surface and downhole gauges can be intermittent. The fragmentary data are an obstacle for reservoir management. The incomplete dataset is commonly simplified by omitting all observations with missing values, which will lead to significant information loss. Thus, to fill the missing data gaps, in this study, we developed and tested several missing data imputation approaches using machine learning and deep learning methods. Traditional data imputation methods such as interpolation and counting most frequent values can introduce bias to the data as the correlations between features are not considered. Thus, in this study, we investigated several multivariate imputation algorithms that use the entire set of available data streams to estimate the missing values. The methods use a full suite of well measurements, including wellhead and downhole pressures, oil, water and gas flow rates, surface and downhole temperatures, choke settings, etc. Any parameter that has gaps in its recorded history can be imputed from the other available data streams. The models were tested on both synthetic and real datasets from operating Norwegian and Abu Dhabi reservoirs. Based on the characteristics of the field data, we introduced different types of continuous missing distributions, which are the combinations of single-multiple missing sections in a long-short time span, to the complete dataset. We observed that as the missing time span expands, the stability of the more successful methods can be kept to a threshold of 30% of the entire dataset. In addition, for a single missing section over a shorter period, which could represent a weather perturbation, most methods we tried were able to achieve high imputation accuracy. In the case of multiple missing sections over a longer time span, which is typical of gauge failures, other methods were better candidates to capture the overall correlation in the multivariate dataset. Most missing data problems addressed in our industry focus on single feature imputation. In this study, we developed an efficient procedure that enables fast reconstruction of the entire production dataset with multiple missing sections in different variables. Ultimately, the complete information can support the reservoir history matching process, production allocation, and develop models for reservoir performance prediction.


2021 ◽  
Author(s):  
◽  
Baligh Al-Helali

<p><b>Symbolic regression is the process of constructing mathematical expressions that best fit given data sets, where a target variable is expressed in terms of input variables. Unlike traditional regression methods, which optimise the parameters of pre-defined models, symbolic regression learns both the model structure and its parameters simultaneously.</b></p> <p>Genetic programming (GP) is a biologically-inspired evolutionary algorithm, that automatically generates computer programs to solve a given task. The flexible representation of GP along with its ``white box" nature makes it a dominant method for symbolic regression. Moreover, GP has been successfully employed for different learning tasks such as feature selection and transfer learning.</p> <p>Data incompleteness is a pervasive problem in symbolic regression, and machine learning in general, especially when dealing with real-world data sets. One common approach to handling data missingness is data imputation. Data imputation is the process of estimating missing values based on existing data. Another approach to deal with incomplete data is to build learning algorithms that directly work with missing values.</p> <p>Although a number of methods have been proposed to tackle the data missingness issue in machine learning, most studies focus on classification tasks. Little attention has been paid to symbolic regression on incomplete data. The existing symbolic regression methods are only applicable when the given data set is complete.</p> <p>The overall goal of the thesis is to improve the performance of symbolic regression on incomplete data by using GP for data imputation, instance selection, feature selection, and transfer learning.</p> <p>This thesis develops an imputation method to handle missing values for symbolic regression. The method integrates the instance-based similarity of the k-nearest neighbour method with the feature-based predictability of GP to estimate the missing values. The results show that the proposed method outperforms existing popular imputation methods.</p> <p>This thesis develops an instance selection method for improving imputation for symbolic regression on incomplete data. The proposed method has the ability to simultaneously build imputation and symbolic regression models such that the performance is improved. The results show that involving instance selection with imputation advances the performance of using the imputation alone.</p> <p>High-dimensionality is a serious data challenge, which is even more difficult on incomplete data. To address this problem in symbolic regression tasks, this thesis develops a feature selection method that can select a good set of features directly from incomplete data. The method not only improves the regression accuracy, but also enhances the efficiency of symbolic regression on high-dimensional incomplete data.</p> <p>Another challenging problem is data shortage. This issue is even more challenging when the data is incomplete. To handle this situation, this thesis develops transfer learning methods to improve symbolic regression in domains with incomplete and limited data. These methods utilise two powerful abilities of GP: feature construction and feature selection. The results show the ability of these methods to achieve positive transfer learning from domains with complete data to different (but related) domains with incomplete data.</p> <p>In summary, the thesis develops a range of approaches to improving the effectiveness and efficiency of symbolic regression on incomplete data by developing a number of GP-based methods. The methods are evaluated using different types of data sets considering various missingness and learning scenarios.</p>


2021 ◽  
Author(s):  
◽  
Baligh Al-Helali

<p><b>Symbolic regression is the process of constructing mathematical expressions that best fit given data sets, where a target variable is expressed in terms of input variables. Unlike traditional regression methods, which optimise the parameters of pre-defined models, symbolic regression learns both the model structure and its parameters simultaneously.</b></p> <p>Genetic programming (GP) is a biologically-inspired evolutionary algorithm, that automatically generates computer programs to solve a given task. The flexible representation of GP along with its ``white box" nature makes it a dominant method for symbolic regression. Moreover, GP has been successfully employed for different learning tasks such as feature selection and transfer learning.</p> <p>Data incompleteness is a pervasive problem in symbolic regression, and machine learning in general, especially when dealing with real-world data sets. One common approach to handling data missingness is data imputation. Data imputation is the process of estimating missing values based on existing data. Another approach to deal with incomplete data is to build learning algorithms that directly work with missing values.</p> <p>Although a number of methods have been proposed to tackle the data missingness issue in machine learning, most studies focus on classification tasks. Little attention has been paid to symbolic regression on incomplete data. The existing symbolic regression methods are only applicable when the given data set is complete.</p> <p>The overall goal of the thesis is to improve the performance of symbolic regression on incomplete data by using GP for data imputation, instance selection, feature selection, and transfer learning.</p> <p>This thesis develops an imputation method to handle missing values for symbolic regression. The method integrates the instance-based similarity of the k-nearest neighbour method with the feature-based predictability of GP to estimate the missing values. The results show that the proposed method outperforms existing popular imputation methods.</p> <p>This thesis develops an instance selection method for improving imputation for symbolic regression on incomplete data. The proposed method has the ability to simultaneously build imputation and symbolic regression models such that the performance is improved. The results show that involving instance selection with imputation advances the performance of using the imputation alone.</p> <p>High-dimensionality is a serious data challenge, which is even more difficult on incomplete data. To address this problem in symbolic regression tasks, this thesis develops a feature selection method that can select a good set of features directly from incomplete data. The method not only improves the regression accuracy, but also enhances the efficiency of symbolic regression on high-dimensional incomplete data.</p> <p>Another challenging problem is data shortage. This issue is even more challenging when the data is incomplete. To handle this situation, this thesis develops transfer learning methods to improve symbolic regression in domains with incomplete and limited data. These methods utilise two powerful abilities of GP: feature construction and feature selection. The results show the ability of these methods to achieve positive transfer learning from domains with complete data to different (but related) domains with incomplete data.</p> <p>In summary, the thesis develops a range of approaches to improving the effectiveness and efficiency of symbolic regression on incomplete data by developing a number of GP-based methods. The methods are evaluated using different types of data sets considering various missingness and learning scenarios.</p>


Author(s):  
C. V. S. R. Syavasya ◽  
M. A. Lakshmi

With the rapid explosion of the data streams from the applications, ensuring accurate data analysis is essential for effective real-time decision making. Nowadays, data stream applications often confront the missing values that affect the performance of the classification models. Several imputation models have adopted the deep learning algorithms for estimating the missing values; however, the lack of parameter and structure tuning in classification, degrade the performance for data imputation. This work presents the missing data imputation model using the adaptive deep incremental learning algorithm for streaming applications. The proposed approach incorporates two main processes: enhancing the deep incremental learning algorithm and enhancing deep incremental learning-based imputation. Initially, the proposed approach focuses on tuning the learning rate with both the Adaptive Moment Estimation (Adam) along with Stochastic Gradient Descent (SGD) optimizers and tuning the hidden neurons. Secondly, the proposed approach applies the enhanced deep incremental learning algorithm to estimate the imputed values in two steps: (i) imputation process to predict the missing values based on the temporal-proximity and (ii) generation of complete IoT dataset by imputing the missing values from both the predicted values. The experimental outcomes illustrate that the proposed imputation model effectively transforms the incomplete dataset into a complete dataset with minimal error.


2021 ◽  
Vol 11 (23) ◽  
pp. 11491
Author(s):  
Laura Sofía Hoyos-Gomez ◽  
Belizza Janet Ruiz-Mendoza

Solar irradiance is an available resource that could support electrification in regions that are low on socio-economic indices. Therefore, it is increasingly important to understand the behavior of solar irradiance. and data on solar irradiance. Some locations, especially those with a low socio-economic population, do not have measured solar irradiance data, and if such information exists, it is not complete. There are different approaches for estimating solar irradiance, from learning models to empirical models. The latter has the advantage of low computational costs, allowing its wide use. Researchers estimate solar energy resources using information from other meteorological variables, such as temperature. However, there is no broad analysis of these techniques in tropical and mountainous environments. Therefore, in order to address this gap, our research analyzes the performance of three well-known empirical temperature-based models—Hargreaves and Samani, Bristol and Campbell, and Okundamiya and Nzeako—and proposes a new one for tropical and mountainous environments. The new empirical technique models daily solar irradiance in some areas better than the other three models. Statistical error comparison allows us to select the best model for each location and determines the data imputation model. Hargreaves and Samani’s model had better results in the Pacific zone with an average RMSE of 936,195 Wh/m2 day, SD of 36,01%, MAE of 748,435 Wh/m2 day, and U95 of 1.836,325 Wh/m2 day. The new proposed model showed better results in the Andean and Amazon zones with an average RMSE of 1.032,99 Wh/m2 day, SD of 34,455 Wh/m2 day, MAE of 825,46 Wh/m2 day, and U95 of 2.025,84 Wh/m2 day. Another result was the linear relationship between the new empirical model constants and the altitude of 2500 MASL (mean above sea level).


Sign in / Sign up

Export Citation Format

Share Document