Causal Feature Selection with Missing Data

2022 ◽  
Vol 16 (4) ◽  
pp. 1-24
Author(s):  
Kui Yu ◽  
Yajing Yang ◽  
Wei Ding

Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically independent of the class variable conditioning on its MB, this enables causal feature selection to identify potential causal features for feature selection for building robust and physically meaningful prediction models. Missing data, ubiquitous in many real-world applications, remain an open research problem in causal feature selection due to its technical complexity. In this article, we discuss a novel multiple imputation MB (MimMB) framework for causal feature selection with missing data. MimMB integrates Data Imputation with MB Learning in a unified framework to enable the two key components to engage with each other. MB Learning enables Data Imputation in a potentially causal feature space for achieving accurate data imputation, while accurate Data Imputation helps MB Learning identify a reliable MB of the class variable in turn. Then, we further design an enhanced kNN estimator for imputing missing values and instantiate the MimMB. In our comprehensively experimental evaluation, our new approach can effectively learn the MB of a given variable in a Bayesian network and outperforms other rival algorithms using synthetic and real-world datasets.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nishith Kumar ◽  
Md. Aminul Hoque ◽  
Masahiro Sugimoto

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.


Hydrology ◽  
2018 ◽  
Vol 5 (4) ◽  
pp. 63 ◽  
Author(s):  
Benjamin Nelsen ◽  
D. Williams ◽  
Gustavious Williams ◽  
Candace Berrett

Complete and accurate data are necessary for analyzing and understanding trends in time-series datasets; however, many of the available time-series datasets have gaps that affect the analysis, especially in the earth sciences. As most available data have missing values, researchers use various interpolation methods or ad hoc approaches to data imputation. Since the analysis based on inaccurate data can lead to inaccurate conclusions, more accurate data imputation methods can provide accurate analysis. We present a spatial-temporal data imputation method using Empirical Mode Decomposition (EMD) based on spatial correlations. We call this method EMD-spatial data imputation or EMD-SDI. Though this method is applicable to other time-series data sets, here we demonstrate the method using temperature data. The EMD algorithm decomposes data into periodic components called intrinsic mode functions (IMF) and exactly reconstructs the original signal by summing these IMFs. EMD-SDI initially decomposes the data from the target station and other stations in the region into IMFs. EMD-SDI evaluates each IMF from the target station in turn and selects the IMF from other stations in the region with periodic behavior most correlated to target IMF. EMD-SDI then replaces a section of missing data in the target station IMF with the section from the most closely correlated IMF from the regional stations. We found that EMD-SDI selects the IMFs used for reconstruction from different stations throughout the region, not necessarily the station closest in the geographic sense. EMD-SDI accurately filled data gaps from 3 months to 5 years in length in our tests and favorably compares to a simple temporal method. EMD-SDI leverages regional correlation and the fact that different stations can be subject to different periodic behaviors. In addition to data imputation, the EMD-SDI method provides IMFs that can be used to better understand regional correlations and processes.


2020 ◽  
Vol 69 ◽  
pp. 1255-1285
Author(s):  
Ricardo Cardoso Pereira ◽  
Miriam Seoane Santos ◽  
Pedro Pereira Rodrigues ◽  
Pedro Henriques Abreu

Missing data is a problem often found in real-world datasets and it can degrade the performance of most machine learning models. Several deep learning techniques have been used to address this issue, and one of them is the Autoencoder and its Denoising and Variational variants. These models are able to learn a representation of the data with missing values and generate plausible new ones to replace them. This study surveys the use of Autoencoders for the imputation of tabular data and considers 26 works published between 2014 and 2020. The analysis is mainly focused on discussing patterns and recommendations for the architecture, hyperparameters and training settings of the network, while providing a detailed discussion of the results obtained by Autoencoders when compared to other state-of-the-art methods, and of the data contexts where they have been applied. The conclusions include a set of recommendations for the technical settings of the network, and show that Denoising Autoencoders outperform their competitors, particularly the often used statistical methods.


Author(s):  
Thelma Dede Baddoo ◽  
Zhijia Li ◽  
Samuel Nii Odai ◽  
Kenneth Rodolphe Chabi Boni ◽  
Isaac Kwesi Nooni ◽  
...  

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.


Author(s):  
Juheng Zhang ◽  
Xiaoping Liu ◽  
Xiao-Bai Li

We study strategically missing data problems in predictive analytics with regression. In many real-world situations, such as financial reporting, college admission, job application, and marketing advertisement, data providers often conceal certain information on purpose in order to gain a favorable outcome. It is important for the decision-maker to have a mechanism to deal with such strategic behaviors. We propose a novel approach to handle strategically missing data in regression prediction. The proposed method derives imputation values of strategically missing data based on the Support Vector Regression models. It provides incentives for the data providers to disclose their true information. We show that with the proposed method imputation errors for the missing values are minimized under some reasonable conditions. An experimental study on real-world data demonstrates the effectiveness of the proposed approach.


2015 ◽  
Vol 2015 ◽  
pp. 1-14 ◽  
Author(s):  
Jaemun Sim ◽  
Jonathan Sangyun Lee ◽  
Ohbyung Kwon

In a ubiquitous environment, high-accuracy data analysis is essential because it affects real-world decision-making. However, in the real world, user-related data from information systems are often missing due to users’ concerns about privacy or lack of obligation to provide complete data. This data incompleteness can impair the accuracy of data analysis using classification algorithms, which can degrade the value of the data. Many studies have attempted to overcome these data incompleteness issues and to improve the quality of data analysis using classification algorithms. The performance of classification algorithms may be affected by the characteristics and patterns of the missing data, such as the ratio of missing data to complete data. We perform a concrete causal analysis of differences in performance of classification algorithms based on various factors. The characteristics of missing values, datasets, and imputation methods are examined. We also propose imputation and classification algorithms appropriate to different datasets and circumstances.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Shahidul Islam Khan ◽  
Abu Sayed Md Latiful Hoque

Abstract In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.


Author(s):  
Yuyu Yin ◽  
Song Aihua ◽  
Gao Min ◽  
Xu Yueshen ◽  
Wang Shuoping

Web service recommendation is one of the key problems in service computing, especially in the case of a large number of service candidates. The QoS (quality of service) values are usually leveraged to recommend services that best satisfy a user’s demand. There are many existing methods using collaborative filtering (CF) to predict QoS missing values, but very limited works can leverage the network location information in the user side and service side. In real-world service invocation scenario, the network location of a user or a service makes great impact on QoS. In this paper, we propose a novel collaborative recommendation framework containing three novel prediction models, which are based on two techniques, i.e. matrix factorization (MF) and network location-aware neighbor selection. We first propose two individual models that have the capability of using the user and service information, respectively. Then we propose a unified model that combines the results of the two individual models. We conduct sufficient experiments on a real-world dataset. The experimental results demonstrate that our models achieve higher prediction accuracy than baseline models, and are not sensitive to the parameters.


2021 ◽  
Author(s):  
Nwamaka Okafor ◽  
Declan Delaney

IoT sensors are becoming increasingly important supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. <br>


2021 ◽  
Author(s):  
Yuanjun Li ◽  
Roland Horne ◽  
Ahmed Al Shmakhy ◽  
Tania Felix Menchaca

Abstract The problem of missing data is a frequent occurrence in well production history records. Due to network outage, facility maintenance or equipment failure, the time series production data measured from surface and downhole gauges can be intermittent. The fragmentary data are an obstacle for reservoir management. The incomplete dataset is commonly simplified by omitting all observations with missing values, which will lead to significant information loss. Thus, to fill the missing data gaps, in this study, we developed and tested several missing data imputation approaches using machine learning and deep learning methods. Traditional data imputation methods such as interpolation and counting most frequent values can introduce bias to the data as the correlations between features are not considered. Thus, in this study, we investigated several multivariate imputation algorithms that use the entire set of available data streams to estimate the missing values. The methods use a full suite of well measurements, including wellhead and downhole pressures, oil, water and gas flow rates, surface and downhole temperatures, choke settings, etc. Any parameter that has gaps in its recorded history can be imputed from the other available data streams. The models were tested on both synthetic and real datasets from operating Norwegian and Abu Dhabi reservoirs. Based on the characteristics of the field data, we introduced different types of continuous missing distributions, which are the combinations of single-multiple missing sections in a long-short time span, to the complete dataset. We observed that as the missing time span expands, the stability of the more successful methods can be kept to a threshold of 30% of the entire dataset. In addition, for a single missing section over a shorter period, which could represent a weather perturbation, most methods we tried were able to achieve high imputation accuracy. In the case of multiple missing sections over a longer time span, which is typical of gauge failures, other methods were better candidates to capture the overall correlation in the multivariate dataset. Most missing data problems addressed in our industry focus on single feature imputation. In this study, we developed an efficient procedure that enables fast reconstruction of the entire production dataset with multiple missing sections in different variables. Ultimately, the complete information can support the reservoir history matching process, production allocation, and develop models for reservoir performance prediction.


Sign in / Sign up

Export Citation Format

Share Document