scholarly journals Missing data treatment method on cluster analysis

Author(s):  
Elsiddig Elsadig Mohamed Koko ◽  
Amin Ibrahim Adam Mohamed

<p>The missing data in household health survey was challenged for the researcher because of incomplete analysis. The statistical tool cluster analysis methodology implemented in the collected data of Sudan's household health survey in 2006.</p><p>Current research specifically focuses on the data analysis as the objective is to deal with the missing values in cluster analysis. Two-Step Cluster Analysis is applied in which each participant is classified into one of the identified pattern and the optimal number of classes is determined using SPSS Statistics/IBM. However, the risk of over-fitting of the data must be considered because cluster analysis is a multivariable statistical technique. Any observation with missing data is excluded in the Cluster Analysis because like multi-variable statistical techniques. Therefore, before performing the cluster analysis, missing values will be imputed using multiple imputations (SPSS Statistics/IBM). The clustering results will be displayed in tables. The descriptive statistics and cluster frequencies will be produced for the final cluster model, while the information criterion table will display results for a range of cluster solutions.</p>

2015 ◽  
Vol 5 (1) ◽  
pp. 15
Author(s):  
Elsiddig Koko ◽  
Amin Ibrahim Adam Mohamed

<p>The present work specifically focuses on the data analysis as the objective is to deal with the missing values in cluster analysis. Two-Step Cluster Analysis is applied in which each participant is classified into one of the identified pattern and the optimal number of classes is determined using SPSS Statistics/IBM. Any observation with missing data is excluded in the Cluster Analysis because like multi-variable statistical techniques. Therefore, before performing the cluster analysis, missing values will be imputed using multiple imputations (SPSS Statistics/IBM). The clustering results will be displayed in tables. Furthermore, goal of analysis is to reduce biases arising from the fact that non-respondents may be different from those who participate and to bring sample data up to the dimensions of the target population totals.</p>


Author(s):  
Fan Ye ◽  
Yong Wang

Data quality, including record inaccuracy and missingness (incompletely recorded crashes and crash underreporting), has always been of concern in crash data analysis. Limited efforts have been made to handle some specific aspects of crash data quality problems, such as using weights in estimation to take care of unreported crash data and applying multiple imputation (MI) to fill in missing information of drivers’ status of attention before crashes. Yet, there lacks a general investigation of the performance of different statistical methods to handle missing crash data. This paper is intended to explore and evaluate the performance of three missing data treatments, which are complete-case analysis (CC), inverse probability weighting (IPW) and MI, in crash severity modeling using the ordered probit model. CC discards those crash records with missing information on any of the variables; IPW includes weights in estimation to adjust for bias, using complete records’ probability of being a complete case; and MI imputes the missing values based on the conditional distribution of the variable with missing information on the observed data. Those missing data treatments provide varying performance in model estimations. Based on analysis of both simulated and real crash data, this paper suggests that the choice of an appropriate missing data treatment should be based on sample size and data missing rate. Meanwhile, it is recommended that MI is used for incompletely recorded crash data and IPW for unreported crashes, before applying crash severity models on crash data.


2013 ◽  
Vol 11 (7) ◽  
pp. 2779-2786
Author(s):  
Rahul Singhai

One relevant problem in data preprocessing is the presence of missing data that leads the poor quality of patterns, extracted after mining. Imputation is one of the widely used procedures that replace the missing values in a data set by some probable values. The advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. This paper analyzes the various imputation methods proposed in the field of statistics with respect to data mining. A comparative analysis of three different imputation approaches which can be used to impute missing attribute values in data mining are given that shows the most promising method. An artificial input data (of numeric type) file of 1000 records is used to investigate the performance of these methods. For testing the significance of these methods Z-test approach were used.


Mathematics ◽  
2021 ◽  
Vol 9 (19) ◽  
pp. 2474
Author(s):  
Nitzan Cohen ◽  
Yakir Berchenko

Information criteria such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are commonly used for model selection. However, the current theory does not support unconventional data, so naive use of these criteria is not suitable for data with missing values. Imputation, at the core of most alternative methods, is both distorted as well as computationally demanding. We propose a new approach that enables the use of classic well-known information criteria for model selection when there are missing data. We adapt the current theory of information criteria through normalization, accounting for the different sample sizes used for each candidate model (focusing on AIC and BIC). Interestingly, when the sample sizes are different, our theoretical analysis finds that AICj/nj is the proper correction for AICj that we need to optimize (where nj is the sample size available to the jth model) while −(BICj−BICi)/(nj−ni) is the correction of BIC. Furthermore, we find that the computational complexity of normalized information criteria methods is exponentially better than that of imputation methods. In a series of simulation studies, we find that normalized-AIC and normalized-BIC outperform previous methods (i.e., normalized-AIC is more efficient, and normalized BIC includes only important variables, although it tends to exclude some of them in cases of large correlation). We propose three additional methods aimed at increasing the statistical efficiency of normalized-AIC: post-selection imputation, Akaike sub-model averaging, and minimum-variance averaging. The latter succeeds in increasing efficiency further.


1990 ◽  
Vol 29 (03) ◽  
pp. 200-204 ◽  
Author(s):  
J. A. Koziol

AbstractA basic problem of cluster analysis is the determination or selection of the number of clusters evinced in any set of data. We address this issue with multinomial data using Akaike’s information criterion and demonstrate its utility in identifying an appropriate number of clusters of tumor types with similar profiles of cell surface antigens.


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2021 ◽  
Vol 12 (1) ◽  
pp. 43
Author(s):  
Xingchen Yan ◽  
Xiaofei Ye ◽  
Jun Chen ◽  
Tao Wang ◽  
Zhen Yang ◽  
...  

Cycling is an increasingly popular mode of transport as part of the response to air pollution, urban congestion, and public health issues. The emergence of bike sharing programs and electric bicycles have also brought about notable changes in cycling characteristics, especially cycling speed. In order to provide a better basis for bicycle-related traffic simulations and theoretical derivations, the study aimed to seek the best distribution for bicycle riding speed considering cyclist characteristics, vehicle type, and track attributes. K-means clustering was performed on speed subcategories while selecting the optimal number of clustering using L method. Then, 15 common models were fitted to the grouped speed data and Kolmogorov–Smirnov test, Akaike information criterion, and Bayesian information criterion were applied to determine the best-fit distribution. The following results were acquired: (1) bicycle speed sub-clusters generated by the combinations of bicycle type, bicycle lateral position, gender, age, and lane width were grouped into three clusters; (2) Among the common distribution, generalized extreme value, gamma and lognormal were the top three models to fit the three clusters of speed dataset; and (3) integrating stability and overall performance, the generalized extreme value was the best-fit distribution of bicycle speed.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Rahi Jain ◽  
Wei Xu

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


Sign in / Sign up

Export Citation Format

Share Document