scholarly journals PERBANDINGAN METODE HOT-DECK IMPUTATION DAN METODE KNNI DALAM MENGATASI MISSING VALUES

2020 ◽  
Vol 2019 (1) ◽  
pp. 275-285
Author(s):  
Iman Jihad Fadillah ◽  
Siti Muchlisoh

Salah satu ciri data statistik yang berkualitas adalah completeness. Namun, pada penyelenggaraan sensus atau survei, sering kali ditemukan masalah data hilang atau tidak lengkap (missing values), tidak terkecuali pada data Survei Sosial Ekonomi Indonesia (Susenas). Berbagai masalah dapat ditimbulkan oleh missing values. Oleh karena itu, masalah missing values harus ditangani. Imputasi adalah cara yang sering digunakan untuk menangani masalah ini. Terdapat beberapa metode imputasi yang telah dikembangkan untuk menangani missing values. Hot-deck Imputation dan K-Nearest Neighbor Imputation (KNNI) merupakan metode yang dapat digunakan untuk menangani masalah missing values. Metode Hot-deck Imputation dan KNNI memanfaatkan variabel prediktor untuk melakukan proses imputasi dan tidak memerlukan asumsi yang rumit dalam penggunaannya. Algoritma dan cara penanganan missing values yang berbeda pada kedua metode tentunya dapat menghasilkan hasil estimasi yang berbeda pula. Penelitian ini membandingkan metode Hot-deck Imputation dan KNNI dalam mengatasi missing values. Analisis perbandingan dilakukan dengan melihat ketepatan estimator melalui nilai RMSE dan MAPE. Selain itu, diukur juga performa komputasi melalui penghitungan running time pada proses imputasi. Implementasi kedua metode pada data Susenas Maret Tahun 2017 menunjukkan bahwa, metode KNNI menghasilkan ketepatan estimator yang lebih baik dibandingkan Hot-deck Imputation. Namun, performa komputasi yang dihasilkan pada Hot-deck Imputation lebih baik dibandingkan KNNI.

1997 ◽  
Vol 08 (03) ◽  
pp. 301-315 ◽  
Author(s):  
Marcel J. Nijman ◽  
Hilbert J. Kappen

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.


2021 ◽  
Author(s):  
Ayesha Sania ◽  
Nicolo Pini ◽  
Morgan Nelson ◽  
Michael Myers ◽  
Lauren Shuffrey ◽  
...  

Abstract Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.


2021 ◽  
Vol 8 (3) ◽  
pp. 215-226
Author(s):  
Parisa Saeipourdizaj ◽  
Parvin Sarbakhsh ◽  
Akbar Gholampour

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM10 and O3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.


2015 ◽  
Vol 12 (13) ◽  
pp. 10511-10544
Author(s):  
R. Dalinina ◽  
V. A. Petryshyn ◽  
D. S. Lim ◽  
A. J. Braverman ◽  
A. K. Tripati

Abstract. Microbialites are a product of trapping and binding of sediment by microbial communities, and are considered to be some of the most ancient records of life on Earth. It is a commonly held belief that microbialites are limited to extreme, hypersaline settings. However, more recent studies report their occurrence in a wider range of environments. The goal of this study is to explore whether microbialite-bearing sites share common geochemical properties. We apply statistical techniques to distinguish any common traits in these environments. These techniques ultimately could be used to address questions of microbialite distribution: are microbialites restricted to environments with specific characteristics; or are they more broadly distributed? A dataset containing hydrographic characteristics of several microbialite sites with data on pH, conductivity, alkalinity, and concentrations of several major anions and cations was constructed from previously published studies. In order to group the water samples by their natural similarities and differences, a clustering approach was chosen for analysis. k means clustering with partial distances was applied to the dataset with missing values, and separated the data into two clusters. One of the clusters is formed by samples from atoll Kiritimati (central Pacific Ocean), and the second cluster contains all other observations. Using these two clusters, the missing values were imputed by k nearest neighbor method, producing a complete dataset that can be used for further multivariate analysis. Salinity is not found to be an important variable defining clustering, and although pH defines clustering in this dataset, it is not an important variable for microbialite formation. Clustering and imputation procedures outlined here can be applied to an expanded dataset on microbialite characteristics in order to determine properties associated with microbialite-containing environments.


Author(s):  
Vinutha M.R ◽  
Chandrika J

<p class="0keywords">In today's fast moving world, Liver Cirrhosis is considered as an aspect having substantial significance both at the national level and international level. The preliminary interest of medical science is to develop a constructive method to predict the Liver Cirrhosis at an early stage. The extreme heterogeneous nature of the disease along with non-standardized treatment makes its management a complex issue. Though medical modalities assess the disease, patients responses creates variation in them. Machine Learning techniques have been used in medical prognosis as it helps physicians to assess the disease faster. Taking this hint and contemplating the troubles faced by the physicians in diagnosing Liver Cirrhosis we have proposed a novel technique called EANNMHO.EANNMHO is a hybrid technique involving EANN-Ensemble Artificial Neural Network and MHO- Modified Harris Hawk Optimization and initially missing values are imputed using K-Nearest Neighbor. The Proposed model when evaluated against other ML techniques produces conclusive results.</p>


2020 ◽  
Author(s):  
Ayesha Sania ◽  
Nicolò Pini ◽  
Morgan E. Nelson ◽  
Michael M. Myers ◽  
Lauren C. Shuffrey ◽  
...  

Abstract Background — Missing data are a source of bias in many epidemiologic studies. This is problematic in alcohol research where data missingness may not be random as they depend on patterns of drinking behavior. Methods — The Safe Passage Study was a prospective investigation of prenatal alcohol consumption and fetal/infant outcomes (n=11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing exposure data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of other participants closest to it. Since participants with no missing days may not be comparable to those with missing data, segments from those with complete and incomplete data were included as a reference. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. We validated our approach by randomly deleting non-missing data for 5-15 consecutive days. Results — We found that data from 5 nearest neighbors (i.e. K=5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from a first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.


2010 ◽  
Vol 40 (2) ◽  
pp. 184-199 ◽  
Author(s):  
Michael J. Falkowski ◽  
Andrew T. Hudak ◽  
Nicholas L. Crookston ◽  
Paul E. Gessler ◽  
Edward H. Uebler ◽  
...  

Sustainable forest management requires timely, detailed forest inventory data across large areas, which is difficult to obtain via traditional forest inventory techniques. This study evaluated k-nearest neighbor imputation models incorporating LiDAR data to predict tree-level inventory data (individual tree height, diameter at breast height, and species) across a 12 100 ha study area in northeastern Oregon, USA. The primary objective was to provide spatially explicit data to parameterize the Forest Vegetation Simulator, a tree-level forest growth model. The final imputation model utilized LiDAR-derived height measurements and topographic variables to spatially predict tree-level forest inventory data. When compared with an independent data set, the accuracy of forest inventory metrics was high; the root mean square difference of imputed basal area and stem volume estimates were 5 m2·ha–1 and 16 m3·ha–1, respectively. However, the error of imputed forest inventory metrics incorporating small trees (e.g., quadratic mean diameter, tree density) was considerably higher. Forest Vegetation Simulator growth projections based upon imputed forest inventory data follow trends similar to growth projections based upon independent inventory data. This study represents a significant improvement in our capabilities to predict detailed, tree-level forest inventory data across large areas, which could ultimately lead to more informed forest management practices and policies.


2021 ◽  
Author(s):  
Tlamelo Emmanuel ◽  
Thabiso Maupong ◽  
Dimane Mpoeleng ◽  
Thabo Semong ◽  
Mphago Banyatsang ◽  
...  

Abstract Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur as a result of various factors like missing completely at random, missing at random or missing not at random. All these may be as a result of system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for. Finally, we experiment on the K nearest neighbor and random forest imputation techniques on novel power plant induced fan data and offer some possible future research direction.


Sign in / Sign up

Export Citation Format

Share Document