Handling Outliers and Missing Data in Regression Models Using R: Simulation Examples

Academic Journal of Applied Mathematical Sciences ◽

10.32861/ajams.68.187.203 ◽

2020 ◽

pp. 187-203

Author(s):

Mohamed Reda Abonazel

Keyword(s):

Monte Carlo Simulation ◽

Missing Data ◽

Regression Models ◽

Missing Values ◽

Model Simulation ◽

Nearest Neighbors ◽

K Nearest Neighbors ◽

Monte Carlo Simulation Study ◽

Simulation Results ◽

Handling Methods

This paper has reviewed two important problems in regression analysis (outliers and missing data), as well as some handling methods for these problems. Moreover, two applications have been introduced to understand and study these methods by R-codes. Practical evidence was provided to researchers to deal with those problems in regression modeling with R. Finally, we created a Monte Carlo simulation study to compare different handling methods of missing data in the regression model. Simulation results indicate that, under our simulation factors, the k-nearest neighbors method is the best method to estimate the missing values in regression models.

Download Full-text

Advanced methods for missing values imputation based on similarity learning

PeerJ Computer Science ◽

10.7717/peerj-cs.619 ◽

2021 ◽

Vol 7 ◽

pp. e619

Author(s):

Khaled M. Fouad ◽

Mahmoud M. Ismail ◽

Ahmad Taher Azar ◽

Mona M. Arafa

Keyword(s):

Missing Data ◽

Missing Values ◽

Imputation Accuracy ◽

Nearest Neighbors ◽

Imputation Method ◽

Data Imputation ◽

K Nearest Neighbors ◽

Missing Data Imputation ◽

K Value ◽

Imputation Methods

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.

Download Full-text

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

10.21203/rs.3.rs-32456/v2 ◽

2021 ◽

Author(s):

Ayesha Sania ◽

Nicolo Pini ◽

Morgan Nelson ◽

Michael Myers ◽

Lauren Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

First Trimester ◽

Epidemiologic Studies ◽

K Nearest Neighbor ◽

Timeline Followback

Abstract Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

Regression Models for Symbolic Interval-Valued Variables

Entropy ◽

10.3390/e23040429 ◽

2021 ◽

Vol 23 (4) ◽

pp. 429

Author(s):

Jose Emmanuel Chacón ◽

Oldemar Rodríguez

Keyword(s):

Regression Models ◽

Mean Squared Error ◽

Nearest Neighbors ◽

Support Vector ◽

K Nearest Neighbors ◽

R Language ◽

Squared Error ◽

Vector Machines ◽

Synthetic Datasets ◽

Interval Valued

This paper presents new approaches to fit regression models for symbolic internal-valued variables, which are shown to improve and extend the center method suggested by Billard and Diday and the center and range method proposed by Lima-Neto, E.A.and De Carvalho, F.A.T. Like the previously mentioned methods, the proposed regression models consider the midpoints and half of the length of the intervals as additional variables. We considered various methods to fit the regression models, including tree-based models, K-nearest neighbors, support vector machines, and neural networks. The approaches proposed in this paper were applied to a real dataset and to synthetic datasets generated with linear and nonlinear relations. For an evaluation of the methods, the root-mean-squared error and the correlation coefficient were used. The methods presented herein are available in the the RSDA package written in the R language, which can be installed from CRAN.

Download Full-text

A comparison of missing data methods for hypothesis tests of the treatment effect in substance abuse clinical trials: a Monte-Carlo simulation study

Substance Abuse Treatment Prevention and Policy ◽

10.1186/1747-597x-3-13 ◽

2008 ◽

Vol 3 (1) ◽

Cited By ~ 5

Author(s):

Sarra L Hedden ◽

Robert F Woolson ◽

Robert J Malcolm

Keyword(s):

Substance Abuse ◽

Monte Carlo Simulation ◽

Clinical Trials ◽

Monte Carlo ◽

Missing Data ◽

Simulation Study ◽

Treatment Effect ◽

Hypothesis Tests ◽

Monte Carlo Simulation Study

Download Full-text

Statistical Tests for the Reciprocal of a Normal Mean with a Known Coefficient of Variation

Journal of Probability and Statistics ◽

10.1155/2015/723924 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5

Author(s):

Wararit Panichkitkosolkul

Keyword(s):

Monte Carlo Simulation ◽

Coefficient Of Variation ◽

Statistical Tests ◽

Taylor Series Expansion ◽

Type I ◽

Type I Errors ◽

Monte Carlo Simulation Study ◽

Asymptotic Test ◽

Normal Mean ◽

Simulation Results

An asymptotic test and an approximate test for the reciprocal of a normal mean with a known coefficient of variation were proposed in this paper. The asymptotic test was based on the expectation and variance of the estimator of the reciprocal of a normal mean. The approximate test used the approximate expectation and variance of the estimator by Taylor series expansion. A Monte Carlo simulation study was conducted to compare the performance of the two statistical tests. Simulation results showed that the two proposed tests performed well in terms of empirical type I errors and power. Nevertheless, the approximate test was easier to compute than the asymptotic test.

Download Full-text

NMF-Based Approach for Missing Values Imputation of Mass Spectrometry Metabolomics Data

Molecules ◽

10.3390/molecules26195787 ◽

2021 ◽

Vol 26 (19) ◽

pp. 5787

Author(s):

Jingjing Xu ◽

Yuanshan Wang ◽

Xiangnan Xu ◽

Kian-Kai Cheng ◽

Daniel Raftery ◽

...

Keyword(s):

Mass Spectrometry ◽

Missing Values ◽

Nearest Neighbors ◽

Imputation Method ◽

Challenging Problem ◽

K Nearest Neighbors ◽

Metabolomics Data ◽

Global And Local ◽

Sample Heterogeneity ◽

Non Negative Matrix Factorization

In mass spectrometry (MS)-based metabolomics, missing values (NAs) may be due to different causes, including sample heterogeneity, ion suppression, spectral overlap, inappropriate data processing, and instrumental errors. Although a number of methodologies have been applied to handle NAs, NA imputation remains a challenging problem. Here, we propose a non-negative matrix factorization (NMF)-based method for NA imputation in MS-based metabolomics data, which makes use of both global and local information of the data. The proposed method was compared with three commonly used methods: k-nearest neighbors (kNN), random forest (RF), and outlier-robust (ORI) missing values imputation. These methods were evaluated from the perspectives of accuracy of imputation, retrieval of data structures, and rank of imputation superiority. The experimental results showed that the NMF-based method is well-adapted to various cases of data missingness and the presence of outliers in MS-based metabolic profiles. It outperformed kNN and ORI and showed results comparable with the RF method. Furthermore, the NMF method is more robust and less susceptible to outliers as compared with the RF method. The proposed NMF-based scheme may serve as an alternative NA imputation method which may facilitate biological interpretations of metabolomics data.

Download Full-text

The K nearest neighbor algorithm for imputation of missing longitudinal prenatal alcohol data

10.21203/rs.3.rs-32456/v1 ◽

2020 ◽

Author(s):

Ayesha Sania ◽

Nicolò Pini ◽

Morgan E. Nelson ◽

Michael M. Myers ◽

Lauren C. Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Alcohol Consumption ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

K Nearest Neighbor ◽

Data Set ◽

Prenatal Alcohol

Abstract Background — Missing data are a source of bias in many epidemiologic studies. This is problematic in alcohol research where data missingness may not be random as they depend on patterns of drinking behavior. Methods — The Safe Passage Study was a prospective investigation of prenatal alcohol consumption and fetal/infant outcomes (n=11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing exposure data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of other participants closest to it. Since participants with no missing days may not be comparable to those with missing data, segments from those with complete and incomplete data were included as a reference. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. We validated our approach by randomly deleting non-missing data for 5-15 consecutive days. Results — We found that data from 5 nearest neighbors (i.e. K=5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from a first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

Incomplete Data Analysis

Applications of Pattern Recognition ◽

10.5772/intechopen.94068 ◽

2021 ◽

Author(s):

Bo-Wei Chen ◽

Jia-Ching Wang

Keyword(s):

Machine Learning ◽

Pattern Recognition ◽

Data Analysis ◽

Future Development ◽

Missing Values ◽

Regression Tree ◽

Nearest Neighbors ◽

K Nearest Neighbors ◽

Numerical Examples ◽

Single Imputation

This chapter discusses missing-value problems from the perspective of machine learning. Missing values frequently occur during data acquisition. When a dataset contains missing values, nonvectorial data are generated. This subsequently causes a serious problem in pattern recognition models because nonvectorial data need further data wrangling before models are built. In view of such, this chapter reviews the methodologies of related works and examines their empirical effectiveness. At present, a great deal of effort has been devoted in this field, and those works can be roughly divided into two types — Multiple imputation and single imputation, where the latter can be further classified into subcategories. They include deletion, fixed-value replacement, K-Nearest Neighbors, regression, tree-based algorithms, and latent component-based approaches. In this chapter, those approaches are introduced and commented. Finally, numerical examples are provided along with recommendations on future development.

Download Full-text

NOVEL ENSEMBLE TECHNIQUES FOR REGRESSION WITH MISSING DATA

New Mathematics and Natural Computation ◽

10.1142/s1793005709001477 ◽

2009 ◽

Vol 05 (03) ◽

pp. 635-652 ◽

Cited By ~ 3

Author(s):

MOSTAFA M. HASSAN ◽

AMIR F. ATIYA ◽

NEAMAT EL GAYAR ◽

RAAFAT EL-FOULY

Keyword(s):

Distribution Function ◽

Missing Data ◽

Probability Distribution ◽

Network Model ◽

Probability Distribution Function ◽

Missing Values ◽

Data Sets ◽

Training Set ◽

Ensemble Techniques ◽

Simulation Results

In this paper, we consider the problem of missing data, and develop an ensemble-network model for handling the missing data. The proposed method is based on utilizing the inherent uncertainty of the missing records in generating diverse training sets for the ensemble's networks. Specifically we generate the missing values using their probability distribution function. We repeat this procedure many times thereby creating a number of complete data sets. A network is trained for each of these data sets, thereby obtaining an ensemble of networks. Several variants are proposed, and we show analytically that one of these variants is superior to the conventional mean-substitution approach for the limit of large training set. Simulation results confirm the general superiority of the proposed methods compared to the conventional approaches.

Download Full-text