imputation methods
Recently Published Documents





2022 ◽  
Vol 9 (3) ◽  
pp. 0-0

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

PLoS ONE ◽  
2022 ◽  
Vol 17 (1) ◽  
pp. e0262131
Adil Aslam Mir ◽  
Kimberlee Jane Kearfott ◽  
Fatih Vehbi Çelebi ◽  
Muhammad Rafique

A new methodology, imputation by feature importance (IBFI), is studied that can be applied to any machine learning method to efficiently fill in any missing or irregularly sampled data. It applies to data missing completely at random (MCAR), missing not at random (MNAR), and missing at random (MAR). IBFI utilizes the feature importance and iteratively imputes missing values using any base learning algorithm. For this work, IBFI is tested on soil radon gas concentration (SRGC) data. XGBoost is used as the learning algorithm and missing data are simulated using R for different missingness scenarios. IBFI is based on the physically meaningful assumption that SRGC depends upon environmental parameters such as temperature and relative humidity. This assumption leads to a model obtained from the complete multivariate series where the controls are available by taking the attribute of interest as a response variable. IBFI is tested against other frequently used imputation methods, namely mean, median, mode, predictive mean matching (PMM), and hot-deck procedures. The performance of the different imputation methods was assessed using root mean squared error (RMSE), mean squared log error (MSLE), mean absolute percentage error (MAPE), percent bias (PB), and mean squared error (MSE) statistics. The imputation process requires more attention when multiple variables are missing in different samples, resulting in challenges to machine learning methods because some controls are missing. IBFI appears to have an advantage in such circumstances. For testing IBFI, Radon Time Series Data (RTS) has been used and data was collected from 1st March 2017 to the 11th of May 2018, including 4 seismic activities that have taken place during the data collection time.

Rie Toyomoto ◽  
Satoshi Funada ◽  
Toshi A. Furukawa

Ryan J. Van Lieshout ◽  
Calan Savoy ◽  
Steven Hanna

Mathematics ◽  
2021 ◽  
Vol 9 (24) ◽  
pp. 3252
Encarnación Álvarez-Verdejo ◽  
Pablo J. Moya-Fernández ◽  
Juan F. Muñoz-Rosas

The problem of missing data is a common feature in any study, and a single imputation method is often applied to deal with this problem. The first contribution of this paper is to analyse the empirical performance of some traditional single imputation methods when they are applied to the estimation of the Gini index, a popular measure of inequality used in many studies. Various methods for constructing confidence intervals for the Gini index are also empirically evaluated. We consider several empirical measures to analyse the performance of estimators and confidence intervals, allowing us to quantify the magnitude of the non-response bias problem. We find extremely large biases under certain non-response mechanisms, and this problem gets noticeably worse as the proportion of missing data increases. For a large correlation coefficient between the target and auxiliary variables, the regression imputation method may notably mitigate this bias problem, yielding appropriate mean square errors. We also find that confidence intervals have poor coverage rates when the probability of data being missing is not uniform, and that the regression imputation method substantially improves the handling of this problem as the correlation coefficient increases.

2021 ◽  
Vol 2021 ◽  
pp. 1-8
Lingju Chen ◽  
Shaoxin Hong ◽  
Bo Tang

We study the identification and estimation of graphical models with nonignorable nonresponse. An observable variable correlated to nonresponse is added to identify the mean of response for the unidentifiable model. An approach to estimating the marginal mean of response is proposed, based on simulation imputation methods which are introduced for a variety of models including linear, generalized linear, and monotone nonlinear models. The proposed mean estimators are N -consistent, where N is the sample size. Finite sample simulations confirm the effectiveness of the proposed method. Sensitivity analysis for the untestable assumption on our augmented model is also conducted. A real data example is employed to illustrate the use of the proposed methodology.

2021 ◽  
Vol 13 (23) ◽  
pp. 4875
Álvaro Acción ◽  
Francisco Argüello ◽  
Dora B. Heras

Deep Learning (DL) has been recently introduced into the hyperspectral and multispectral image classification landscape. Despite the success of DL in the remote sensing field, DL models are computationally intensive due to the large number of parameters they need to learn. The high density of information present in remote sensing imagery with high spectral resolution can make the application of DL models to large scenes challenging. Methods such as patch-based classification require large amounts of data to be processed during the training and prediction stages, which translates into long processing times and high energy consumption. One of the solutions to decrease the computational cost of these models is to perform segment-based classification. Segment-based classification schemes can significantly decrease training and prediction times, and also offer advantages over simply reducing the size of the training datasets by randomly sampling training data. The lack of a large enough number of samples can, however, pose an additional challenge, causing these models to not generalize properly. Data augmentation methods are used to generate new synthetic samples based on existing data to increase the classification performance. In this work, we propose a new data augmentation scheme using data imputation and matrix completion methods for segment-based classification. The proposal has been validated using two high-resolution multispectral datasets from the literature. The results obtained show that the proposed approach successfully increases the classification performance across all the scenes tested and that data imputation methods applied to multispectral imagery are a valid means to perform data augmentation. A comparison of classification accuracy between different imputation methods applied to the proposed scheme was also carried out.

Sign in / Sign up

Export Citation Format

Share Document