The K nearest neighbor algorithm for imputation of missing longitudinal prenatal alcohol data

Abstract Background — Missing data are a source of bias in many epidemiologic studies. This is problematic in alcohol research where data missingness may not be random as they depend on patterns of drinking behavior. Methods — The Safe Passage Study was a prospective investigation of prenatal alcohol consumption and fetal/infant outcomes (n=11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing exposure data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of other participants closest to it. Since participants with no missing days may not be comparable to those with missing data, segments from those with complete and incomplete data were included as a reference. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. We validated our approach by randomly deleting non-missing data for 5-15 consecutive days. Results — We found that data from 5 nearest neighbors (i.e. K=5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from a first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

10.21203/rs.3.rs-32456/v2 ◽

2021 ◽

Author(s):

Ayesha Sania ◽

Nicolo Pini ◽

Morgan Nelson ◽

Michael Myers ◽

Lauren Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

First Trimester ◽

Epidemiologic Studies ◽

K Nearest Neighbor ◽

Timeline Followback

Abstract Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

10.21203/rs.3.rs-32456/v3 ◽

2021 ◽

Author(s):

Ayesha Sania ◽

Nicolò Pini ◽

Morgan E. Nelson ◽

Michael M. Myers ◽

Lauren C. Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

First Trimester ◽

Epidemiologic Studies ◽

K Nearest Neighbor ◽

Using Data

Abstract Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Follow-back method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “k-Nearest Neighbor” (k-NN). k-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on 500 iterations after randomly deleting data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days from first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — k-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.

Download Full-text

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

10.21203/rs.3.rs-153387/v1 ◽

2021 ◽

Author(s):

Ayesha Sania ◽

Nicolo Pini ◽

Morgan Nelson ◽

Michael Myers ◽

Lauren Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

First Trimester ◽

Epidemiologic Studies ◽

K Nearest Neighbor ◽

Timeline Followback

Download Full-text

Symmetry Breaking and Training from Incomplete Data with Radial Basis Boltzmann Machines

International Journal of Neural Systems ◽

10.1142/s0129065797000318 ◽

1997 ◽

Vol 08 (03) ◽

pp. 301-315 ◽

Cited By ~ 8

Author(s):

Marcel J. Nijman ◽

Hilbert J. Kappen

Keyword(s):

Symmetry Breaking ◽

Incomplete Data ◽

Missing Values ◽

Nearest Neighbor ◽

Boltzmann Machine ◽

K Nearest Neighbor ◽

Data Set ◽

Input Space ◽

Learning Rules ◽

Radial Basis

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.

Download Full-text

Application of imputation methods for missing values of PM10 and O3 data: Interpolation, moving average and K-nearest neighbor methods

Environmental Health Engineering and Management ◽

10.34172/ehem.2021.25 ◽

2021 ◽

Vol 8 (3) ◽

pp. 215-226

Author(s):

Parisa Saeipourdizaj ◽

Parvin Sarbakhsh ◽

Akbar Gholampour

Keyword(s):

Missing Data ◽

Human Error ◽

Missing Values ◽

Nearest Neighbor ◽

Moving Average ◽

Classification And Regression Tree ◽

Coefficient Of Determination ◽

K Nearest Neighbor ◽

Imputation Methods ◽

Machine Failure

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM10 and O3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.

Download Full-text

Impact of a pulsed xenon disinfection system on hospital onset Clostridioides difficile infections in 48 hospitals over a 5-year period

BMC Infectious Diseases ◽

10.1186/s12879-021-06789-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Sarah Simmons ◽

Grady Wier ◽

Antonio Pedraza ◽

Mark Stibich

Keyword(s):

Nearest Neighbor ◽

Negative Binomial ◽

Learning Algorithm ◽

Compliance Rate ◽

Negative Binomial Regression ◽

Inverse Association ◽

K Nearest Neighbor ◽

Data Set ◽

Clostridioides Difficile ◽

The Impact

Abstract Background The role of the environment in hospital acquired infections is well established. We examined the impact on the infection rate for hospital onset Clostridioides difficile (HO-CDI) of an environmental hygiene intervention in 48 hospitals over a 5 year period using a pulsed xenon ultraviolet (PX-UV) disinfection system. Methods Utilization data was collected directly from the automated PX-UV system and uploaded in real time to a database. HO-CDI data was provided by each facility. Data was analyzed at the unit level to determine compliance to disinfection protocols. Final data set included 5 years of data aggregated to the facility level, resulting in a dataset of 48 hospitals and a date range of January 2015–December 2019. Negative binomial regression was used with an offset on patient days to convert infection count data and assess HO-CDI rates vs. intervention compliance rate, total successful disinfection cycles, and total rooms disinfected. The K-Nearest Neighbor (KNN) machine learning algorithm was used to compare intervention compliance and total intervention cycles to presence of infection. Results All regression models depict a statistically significant inverse association between the intervention and HO-CDI rates. The KNN model predicts the presence of infection (or whether an infection will be present or not) with greater than 98% accuracy when considering both intervention compliance and total intervention cycles. Conclusions The findings of this study indicate a strong inverse relationship between the utilization of the pulsed xenon intervention and HO-CDI rates.

Download Full-text

A Survey On Missing Data in Machine Learning

10.21203/rs.3.rs-535520/v1 ◽

2021 ◽

Author(s):

Tlamelo Emmanuel ◽

Thabiso Maupong ◽

Dimane Mpoeleng ◽

Thabo Semong ◽

Mphago Banyatsang ◽

...

Keyword(s):

Machine Learning ◽

Missing Data ◽

Human Error ◽

Missing Values ◽

Nearest Neighbor ◽

Research Direction ◽

Machine Learning Techniques ◽

Future Research ◽

Learning Approaches ◽

K Nearest Neighbor

Abstract Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur as a result of various factors like missing completely at random, missing at random or missing not at random. All these may be as a result of system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for. Finally, we experiment on the K nearest neighbor and random forest imputation techniques on novel power plant induced fan data and offer some possible future research direction.

Download Full-text

Fast Approximate Complete-data k-nearest-neighbor Estimation

Austrian Journal of Statistics ◽

10.17713/ajs.v49i2.907 ◽

2020 ◽

Vol 49 (2) ◽

pp. 18-30

Author(s):

Alejandro Murua ◽

Nicolas Wicker

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbors ◽

Complete Data ◽

Fast Method ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

Data Set ◽

Neighbor Graph ◽

Very Large Datasets ◽

Nearest Neighbor Graph

We introduce a fast method to estimate the complete-data set of k-nearest-neighbors.This is equivalent to finding an estimate of the k-nearest-neighbor graph of the data. The method relies on random normal projections. The k-nearest-neighbors are estimated by sorting points in a number of random lines. For very large datasets, the method is quasi-linear in the data size. As an application, we show that the intrinsic dimension of a manifold can be reliably estimated from the estimated set of k-nearest-neighbors in time about two orders of magnitude faster than when using the exact set of k-nearest-neighbors.

Download Full-text

CHOOSING APPROPRIATE IMPUTATION METHODS FOR MISSING DATA: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Journal of Al-Qadisiyah for Computer Science and Mathematics ◽

10.29304/jqcm.2019.11.2.588 ◽

2019 ◽

Vol 11 (2) ◽

pp. 65-73

Author(s):

Wisam A. Mahmood ◽

Mohammed S. Rashid ◽

Teaba Wala Aldeen ◽

Teaba Wala Aldeen

Keyword(s):

Missing Data ◽

Simulation Study ◽

Missing Values ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Decision Algorithm ◽

Imputation Methods ◽

Regression Imputation ◽

Mean Imputation

Missing values commonly happen in the realm of medical research, which is regarded creating a lot of bias in case it is neglected with poor handling. However, while dealing with such challenges, some standard statistical methods have been already developed and available, yet no credible method is available so far to infer credible estimates. The existing data size gets lowered, apart from a decrease in efficiency happens when missing values is found in a dataset. A number of imputation methods have addressed such challenges in early scholarly works for handling missing values. Some of the regular methods include complete case method, mean imputation method, Last Observation Carried Forward (LOCF) method, Expectation-Maximization (EM) algorithm, and Markov Chain Monte Carlo (MCMC), Mean Imputation (Mean), Hot Deck (HOT), Regression Imputation (Regress), K-nearest neighbor (KNN),K-Mean Clustering, Fuzzy K-Mean Clustering, Support Vector Machine, and Multiple Imputation (MI) method. In the present paper, a simulation study is attempted for carrying out an investigative exploration into the efficacy of the above mentioned archetypal imputation methods along with longitudinal data setting under missing completely at random (MCAR). We took out missingness from three cases in a block having low missingness of 5% as well as higher levels at 30% and 50%. With this simulation study, we concluded LOCF method having more bias than the other methods in most of the situations after carrying out a comparison through simulation study.

Download Full-text

Comparative Study of Three Imputation Methods to Treat Missing Values

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v11i7.3472 ◽

2013 ◽

Vol 11 (7) ◽

pp. 2779-2786

Author(s):

Rahul Singhai

Keyword(s):

Data Mining ◽

Missing Data ◽

Missing Values ◽

Learning Algorithm ◽

Poor Quality ◽

Imputation Method ◽

Data Set ◽

Imputation Methods ◽

Missing Data Treatment

One relevant problem in data preprocessing is the presence of missing data that leads the poor quality of patterns, extracted after mining. Imputation is one of the widely used procedures that replace the missing values in a data set by some probable values. The advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. This paper analyzes the various imputation methods proposed in the field of statistics with respect to data mining. A comparative analysis of three different imputation approaches which can be used to impute missing attribute values in data mining are given that shows the most promising method. An artificial input data (of numeric type) file of 1000 records is used to investigate the performance of these methods. For testing the significance of these methods Z-test approach were used.

Download Full-text