A Novel Imputation Approach for Sharing Protected Public Health Data

2021 ◽  
pp. e1-e9
Author(s):  
Elizabeth A. Erdman ◽  
Leonard D. Young ◽  
Dana L. Bernson ◽  
Cici Bauer ◽  
Kenneth Chui ◽  
...  

Objectives. To develop an imputation method to produce estimates for suppressed values within a shared government administrative data set to facilitate accurate data sharing and statistical and spatial analyses. Methods. We developed an imputation approach that incorporated known features of suppressed Massachusetts surveillance data from 2011 to 2017 to predict missing values more precisely. Our methods for 35 de-identified opioid prescription data sets combined modified previous or next substitution followed by mean imputation and a count adjustment to estimate suppressed values before sharing. We modeled 4 methods and compared the results to baseline mean imputation. Results. We assessed performance by comparing root mean squared error (RMSE), mean absolute error (MAE), and proportional variance between imputed and suppressed values. Our method outperformed mean imputation; we retained 46% of the suppressed value’s proportional variance with better precision (22% lower RMSE and 26% lower MAE) than simple mean imputation. Conclusions. Our easy-to-implement imputation technique largely overcomes the adverse effects of low count value suppression with superior results to simple mean imputation. This novel method is generalizable to researchers sharing protected public health surveillance data. (Am J Public Health. Published online ahead of print September 16, 2021: e1–e9. https://doi.org/10.2105/AJPH.2021.306432 )

2021 ◽  
Vol 20 ◽  
pp. 415-430
Author(s):  
Juthaphorn Sinsomboonthong ◽  
Saichon Sinsomboonthong

The proposed estimator, namely weighted maximum likelihood (WML) correlation coefficient, for measuring the relationship between two variables to concern about missing values and outliers in the dataset is presented. This estimator is proven by applying the conditional probability function to take care of some missing values and pay more attention to values near the center. However, outliers in the dataset are assigned a slight weight. These using techniques will give the robust proposed method when the preliminary assumptions are not met data analysis. To inspect about the quality of the proposed estimator, the six methods—WML, Pearson, median, percentage bend, biweight mid, and composite correlation coefficients—are compared the properties in two criteria, i.e. the bias and mean squared error, via the simulation study. The results of generated data are illustrated that the WML estimator seems to have the best performance to withstand the missing values and outliers in dataset, especially for the tiny sample size and large percentage of outliers regardless of missing data levels. However, for the massive sample size, the median correlation coefficient seems to have the good estimator when linear relationship levels between two variables are approximately over 0.4 irrespective of outliers and missing data levels


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Waqed H. Hassan ◽  
Halah K. Jalal

AbstractLocal scouring around the piers of a bridge is the one of the major reasons for bridge failure, potentially resulting in heavy losses in terms of both the economy and human life. Prediction of accurate depth of local scouring is a difficult task due to the many factors that contribute to this process, however. The main aim of this study is thus to offer a new formula for the prediction the local depth of scouring around the pier of a bridge using a modern fine computing modelling technique known as gene expression programming (GEP), with data obtained from numerical simulations used to compare GEP performance with that of a standard non-linear regression (NLR) model. The best technique for prediction of the local scouring depth is then determined based on three statistical parameters: the determination coefficient (R2), mean absolute error (MAE), and root mean squared error (RMSE). A total data set of 243 measurements, obtained by numerical simulation in Flow-3D, for intensity of flow, ratio of pier width, ratio of flow depth, pier Froude number, and pier shape factor is divided into training and validation (testing) datasets to achieve this. The results suggest that the formula from the GEP model provides better performance for predicting the local depth of scouring as compared with conventional regression with the NLR model, with R2 = 0.901, MAE = 0.111, and RMSE = 0.142. The sensitivity analysis results further suggest that the ratio of the depth of flow has the greatest impact on the prediction of local scour depth as compared to the other input parameters. The formula obtained from the GEP model gives the best predictor of depth of scouring, and, in addition, GEP offers the special feature of providing both explicit and compressed arithmetical terms to allow calculation of such depth of scouring.


2015 ◽  
Vol 754-755 ◽  
pp. 923-932 ◽  
Author(s):  
Norazian Mohamed Noor ◽  
A.S. Yahaya ◽  
N.A. Ramli ◽  
Mohd Mustafa Al Bakri Abdullah

Hourly measured PM10 concentration at eight monitoring stations within peninsular Malaysia in 2006 was used to conduct the simulated missing data. The gap lengths of the simulated missing values are limited to 12 hours since the actual trend of missingness is considered short. Two percentages of simulated missing gaps were generated that are 5 % and 15 %. A number of single imputation methods (linear interpolation (LI), nearest neighbour interpolation (NN), mean above below (MAB), daily mean (DM), mean 12-hour (12M), mean 6-hour (6M), row mean (RM) and previous year (PY)) were calculated to fill in the simulated missing data. In addition, multiple imputation (MI) was also conducted to compare between the single imputation methods. The performances were evaluated using four statistical criteria namely mean absolute error, root mean squared error, prediction accuracy and index of agreement. The results show that 6M perform comparably well to LI. Thus, this show that the effect of smaller averaging time gives better prediction. Other single imputation methods predict the missing data well except for PY. RM and MI performs moderately with the increasing performance in higher fraction of missing gaps whereas LR makes the worst methods for both simulated missing data percentages.


Prediction of client behavior and their feedback remains as a challenging task in today’s world for all the manufacturing companies. The companies are struggling to increase their profit and annual turnover due to the lack of exact prediction of customer like and dislike. This leads to the accomplishment of machine learning algorithms for the prediction of customer demands. This paper attempts to identify the important features of the wine data set extracted from UCI Machine learning repository for the prediction of customer segment. The important features are extracted for the various ensembling methods like Ada boost regressor, Ada boost classifier, Random forest regressor, Extra Trees Regressor, Gradient booster regressor. The extracted feature importance of each of the ensembling methods is then fitted with logistic regression to analyze the performance. The same extracted feature importance of each of the ensembling methods are subjected to feature scaling and then fitted with logistic regression to analyze the performance. The Performance analysis is done with the performance metric such as Mean Squared error (MSE), Mean Absolute error (MAE), R2 Score, Explained Variance Score (EVS) and Mean Squared Log Error (MSLE). Experimental results shows that after applying feature scaling, the feature importance extracted from the Extra Tree Regressor is found to be effective with the MSE of 0.04, MAE of 0.03, R2 Score of 94%, EVS of 0.9 and MSLE of 0.01 as compared to other ensembling methods.


Author(s):  
S W J Nijman ◽  
J Hoogland ◽  
T K J Groenhof ◽  
M Brandjes ◽  
J J L Jacobs ◽  
...  

Abstract Introduction Use of prediction models is widely recommended by clinical guidelines, but usually requires complete information on all predictors that is not always available in daily practice. Methods We describe two methods for real-time handling of missing predictor values when using prediction models in practice. We compare the widely used method of mean imputation (M-imp) to a method that personalizes the imputations by taking advantage of the observed patient characteristics. These characteristics may include both prediction model variables and other characteristics (auxiliary variables). The method was implemented using imputation from a joint multivariate normal model of the patient characteristics (joint modeling imputation; JMI). Data from two different cardiovascular cohorts with cardiovascular predictors and outcome were used to evaluate the real-time imputation methods. We quantified the prediction model’s overall performance (mean squared error (MSE) of linear predictor), discrimination (c-index), calibration (intercept and slope) and net benefit (decision curve analysis). Results When compared with mean imputation, JMI substantially improved the MSE (0.10 vs. 0.13), c-index (0.70 vs 0.68) and calibration (calibration-in-the-large: 0.04 vs. 0.06; calibration slope: 1.01 vs. 0.92), especially when incorporating auxiliary variables. When the imputation method was based on an external cohort, calibration deteriorated, but discrimination remained similar. Conclusions We recommend JMI with auxiliary variables for real-time imputation of missing values, and to update imputation models when implementing them in new settings or (sub)populations.


2018 ◽  
Author(s):  
Md. Bahadur Badsha ◽  
Rui Li ◽  
Boxiang Liu ◽  
Yang I. Li ◽  
Min Xian ◽  
...  

ABSTRACTBackgroundSingle-cell RNA-sequencing (scRNA-seq) is a rapidly evolving technology that enables measurement of gene expression levels at an unprecedented resolution. Despite the explosive growth in the number of cells that can be assayed by a single experiment, scRNA-seq still has several limitations, including high rates of dropouts, which result in a large number of genes having zero read count in the scRNA-seq data, and complicate downstream analyses.MethodsTo overcome this problem, we treat zeros as missing values and develop nonparametric deep learning methods for imputation. Specifically, our LATE (Learning with AuToEncoder) method trains an autoencoder with random initial values of the parameters, whereas our TRANSLATE (TRANSfer learning with LATE) method further allows for the use of a reference gene expression data set to provide LATE with an initial set of parameter estimates.ResultsOn both simulated and real data, LATE and TRANSLATE outperform existing scRNA-seq imputation methods, achieving lower mean squared error in most cases, recovering nonlinear gene-gene relationships, and better separating cell types. They are also highly scalable and can efficiently process over 1 million cells in just a few hours on a GPU.ConclusionsWe demonstrate that our nonparametric approach to imputation based on autoencoders is powerful and highly efficient.


Agriculture plays an important role in Indian economy. Maximizing the crop productivity in one of the main tasks that farmers are facing in their day to day life. They are also lacking in the basic knowledge of nutrient content of soil and selection of crops those best suits their soil thereby improving the crop productivity. In this work, the dataset has been taken from the soil test centres of Dindigul district, Tamilnadu. The parameters are the 12 various nutrients present in the soil samples collected from the different regions of Dindigul district. Using PCA, the dataset has been reduced to 8 parameters. Data Mining classification techniques like decision tree, KNN, Kernal SVM, Linear SVM, Logistic regression, Naive Bayes and Random forest are deployed on the original and dimensionality reduced datasets to predict the crops to be cultivated based on the availability of soil nutrient in the datasets. The performance of the algorithms are analysed based on certain metrics like Accuracy score, Cohen’s Kappa, Precision, Recall And F-Measures, Hamming Loss, Explained Variance Score, Mean Absolute Error, Mean Squared Error and Mean Squared Logarithmic Error. The Confusion matrix and Classification report are used for analysis. The Decision Tree is found to be the best algorithm for the soil datasets and dimensionality reduction does not affect the prediction.


2012 ◽  
Vol 57 (1) ◽  
Author(s):  
HO MING KANG ◽  
FADHILAH YUSOF ◽  
ISMAIL MOHAMAD

This paper presents a study on the estimation of missing data. Data samples with different missingness mechanism namely Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) are simulated accordingly. Expectation maximization (EM) algorithm and mean imputation (MI) are applied to these data sets and compared and the performances are evaluated by the mean absolute error (MAE) and root mean square error (RMSE). The results showed that EM is able to estimate the missing data with minimum errors compared to mean imputation (MI) for the three missingness mechanisms. However the graphical results showed that EM failed to estimate the missing values in the missing quadrants when the situation is MNAR.


2019 ◽  
Vol 8 (3) ◽  
pp. 8070-8074 ◽  

Data quality is an important aspect for any data mining and statistical tasks. Presence of missing values in the dataset affects the data quality. Missing values refers to the event did not happen or the value does not exist. Data mining algorithms are not robust towards incomplete data. Imputation of missing values is necessary to improve the data quality for performing data mining and statistical analysis. The existing methods such as Expectation Maximization Imputation (EMI), A Framework for Imputing Missing values Using co appearance, correlation and Similarity analysis (FIMUS) use the whole dataset to impute missing values. In such cases, due to the influence of irrelevant record the accuracy of imputation may be affected. This can be controlled by only considering locally similar records to impute missing values. Local similarity imputation can be done through clustering algorithms such as k-means algorithm. K-means clustering efficiency depends on the number of clusters is to be defined by users. To increase the clustering efficiency, first distinctive value is imputed in place of missing ones and this imputed dataset is given to stacked autoencoder for dimensionality reduction which also improves the efficiency of clustering. Initial number of clusters to k-means algorithm is determined using fast clustering. Due to initial imputation, some irrelevant records may be partitioned to a cluster. When these records are used for imputing missing values, accuracy of imputation decreases. In the proposed algorithm, local similarity imputation algorithm uses only top knearest neighbours within the cluster to impute missing values. The performance of the proposed algorithm is evaluated based on Root-Mean-Squared-Error (RMSE) and Index of Agreement (d2). University of California Irvine datasets has been used for analyzing the performance of the proposed algorithm.


2013 ◽  
Vol 594-595 ◽  
pp. 889-895 ◽  
Author(s):  
M.N. Noor ◽  
A.S. Yahaya ◽  
N.A. Ramli ◽  
Abdullah Mohd Mustafa Al Bakri

The presence of missing values in statistical survey data is an important issue to deal with. These data usually contained missing values due to many factors such as machine failures, changes in the siting monitors, routine maintenance and human error. Incomplete data set usually cause bias due to differences between observed and unobserved data. Therefore, it is important to ensure that the data analyzed are of high quality. A straightforward approach to deal with this problem is to ignore the missing data and to discard those incomplete cases from the data set. This approach is generally not valid for time-series prediction, in which the value of a system typically depends on the historical time data of the system. One approach that commonly used for the treatment of this missing item is adoption of imputation technique. This paper discusses three interpolation methods that are linear, quadratic and cubic. A total of 8577 observations of PM10 data for a year were used to compare between the three methods when fitting the Gamma distribution. The goodness-of-fit were obtained using three performance indicators that are mean absolute error (MAE), root mean squared error (RMSE) and coefficient of determination (R2). The results shows that the linear interpolation method provides a very good fit to the data.


Sign in / Sign up

Export Citation Format

Share Document