Missing data treatment method on cluster analysis

The missing data in household health survey was challenged for the researcher because of incomplete analysis. The statistical tool cluster analysis methodology implemented in the collected data of Sudan's household health survey in 2006.Current research specifically focuses on the data analysis as the objective is to deal with the missing values in cluster analysis. Two-Step Cluster Analysis is applied in which each participant is classified into one of the identified pattern and the optimal number of classes is determined using SPSS Statistics/IBM. However, the risk of over-fitting of the data must be considered because cluster analysis is a multivariable statistical technique. Any observation with missing data is excluded in the Cluster Analysis because like multi-variable statistical techniques. Therefore, before performing the cluster analysis, missing values will be imputed using multiple imputations (SPSS Statistics/IBM). The clustering results will be displayed in tables. The descriptive statistics and cluster frequencies will be produced for the final cluster model, while the information criterion table will display results for a range of cluster solutions.

Download Full-text

Clustering and multiple imputation of missing data

International Journal of Basic and Applied Sciences ◽

10.14419/ijbas.v5i1.5470 ◽

2015 ◽

Vol 5 (1) ◽

pp. 15

Author(s):

Elsiddig Koko ◽

Amin Ibrahim Adam Mohamed

Keyword(s):

Cluster Analysis ◽

Data Analysis ◽

Missing Data ◽

Missing Values ◽

Target Population ◽

Optimal Number ◽

Statistical Techniques ◽

Multiple Imputations ◽

Sample Data ◽

Number Of Classes

The present work specifically focuses on the data analysis as the objective is to deal with the missing values in cluster analysis. Two-Step Cluster Analysis is applied in which each participant is classified into one of the identified pattern and the optimal number of classes is determined using SPSS Statistics/IBM. Any observation with missing data is excluded in the Cluster Analysis because like multi-variable statistical techniques. Therefore, before performing the cluster analysis, missing values will be imputed using multiple imputations (SPSS Statistics/IBM). The clustering results will be displayed in tables. Furthermore, goal of analysis is to reduce biases arising from the fact that non-respondents may be different from those who participate and to bring sample data up to the dimensions of the target population totals.

Download Full-text

Performance Evaluation of Various Missing Data Treatments in Crash Severity Modeling

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198118798485 ◽

2018 ◽

Vol 2672 (38) ◽

pp. 149-159

Author(s):

Fan Ye ◽

Yong Wang

Keyword(s):

Missing Data ◽

Data Quality ◽

Missing Values ◽

Probit Model ◽

Crash Severity ◽

Missing Information ◽

Complete Case ◽

Crash Data ◽

Missing Data Treatments ◽

Missing Data Treatment

Data quality, including record inaccuracy and missingness (incompletely recorded crashes and crash underreporting), has always been of concern in crash data analysis. Limited efforts have been made to handle some specific aspects of crash data quality problems, such as using weights in estimation to take care of unreported crash data and applying multiple imputation (MI) to fill in missing information of drivers’ status of attention before crashes. Yet, there lacks a general investigation of the performance of different statistical methods to handle missing crash data. This paper is intended to explore and evaluate the performance of three missing data treatments, which are complete-case analysis (CC), inverse probability weighting (IPW) and MI, in crash severity modeling using the ordered probit model. CC discards those crash records with missing information on any of the variables; IPW includes weights in estimation to adjust for bias, using complete records’ probability of being a complete case; and MI imputes the missing values based on the conditional distribution of the variable with missing information on the observed data. Those missing data treatments provide varying performance in model estimations. Based on analysis of both simulated and real crash data, this paper suggests that the choice of an appropriate missing data treatment should be based on sample size and data missing rate. Meanwhile, it is recommended that MI is used for incompletely recorded crash data and IPW for unreported crashes, before applying crash severity models on crash data.

Download Full-text

A missing data treatment method for photovoltaic installations

2018 IEEE International Energy Conference (ENERGYCON) ◽

10.1109/energycon.2018.8398780 ◽

2018 ◽

Cited By ~ 2

Author(s):

Ioannis P. Panapakidis ◽

Aggelos S. Bouhouras ◽

Georgios C. Christoforidis

Keyword(s):

Missing Data ◽

Treatment Method ◽

Data Treatment ◽

Missing Data Treatment

Download Full-text

Comparative Study of Three Imputation Methods to Treat Missing Values

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v11i7.3472 ◽

2013 ◽

Vol 11 (7) ◽

pp. 2779-2786

Author(s):

Rahul Singhai

Keyword(s):

Data Mining ◽

Missing Data ◽

Missing Values ◽

Learning Algorithm ◽

Poor Quality ◽

Imputation Method ◽

Data Set ◽

Imputation Methods ◽

Missing Data Treatment

One relevant problem in data preprocessing is the presence of missing data that leads the poor quality of patterns, extracted after mining. Imputation is one of the widely used procedures that replace the missing values in a data set by some probable values. The advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. This paper analyzes the various imputation methods proposed in the field of statistics with respect to data mining. A comparative analysis of three different imputation approaches which can be used to impute missing attribute values in data mining are given that shows the most promising method. An artificial input data (of numeric type) file of 1000 records is used to investigate the performance of these methods. For testing the significance of these methods Z-test approach were used.

Download Full-text

Normalized Information Criteria and Model Selection in the Presence of Missing Data

Mathematics ◽

10.3390/math9192474 ◽

2021 ◽

Vol 9 (19) ◽

pp. 2474

Author(s):

Nitzan Cohen ◽

Yakir Berchenko

Keyword(s):

Missing Data ◽

Model Selection ◽

Missing Values ◽

Model Averaging ◽

Information Criterion ◽

Current Theory ◽

Information Criteria ◽

Alternative Methods ◽

Sample Sizes ◽

Statistical Efficiency

Information criteria such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are commonly used for model selection. However, the current theory does not support unconventional data, so naive use of these criteria is not suitable for data with missing values. Imputation, at the core of most alternative methods, is both distorted as well as computationally demanding. We propose a new approach that enables the use of classic well-known information criteria for model selection when there are missing data. We adapt the current theory of information criteria through normalization, accounting for the different sample sizes used for each candidate model (focusing on AIC and BIC). Interestingly, when the sample sizes are different, our theoretical analysis finds that AICj/nj is the proper correction for AICj that we need to optimize (where nj is the sample size available to the jth model) while −(BICj−BICi)/(nj−ni) is the correction of BIC. Furthermore, we find that the computational complexity of normalized information criteria methods is exponentially better than that of imputation methods. In a series of simulation studies, we find that normalized-AIC and normalized-BIC outperform previous methods (i.e., normalized-AIC is more efficient, and normalized BIC includes only important variables, although it tends to exclude some of them in cases of large correlation). We propose three additional methods aimed at increasing the statistical efficiency of normalized-AIC: post-selection imputation, Akaike sub-model averaging, and minimum-variance averaging. The latter succeeds in increasing efficiency further.

Download Full-text

Cluster Analysis of Antigenic Profiles of Tumors: Selection of Number of Clusters Using Akaike’s Information Criterion

Methods of Information in Medicine ◽

10.1055/s-0038-1634783 ◽

1990 ◽

Vol 29 (03) ◽

pp. 200-204 ◽

Cited By ~ 7

Author(s):

J. A. Koziol

Keyword(s):

Cluster Analysis ◽

Basic Problem ◽

Information Criterion ◽

Akaike's Information Criterion ◽

Cell Surface Antigens ◽

Number Of Clusters ◽

Akaike’S Information Criterion ◽

Multinomial Data ◽

Tumor Types ◽

Selection Of

AbstractA basic problem of cluster analysis is the determination or selection of the number of clusters evinced in any set of data. We address this issue with multinomial data using Akaike’s information criterion and demonstrate its utility in identifying an appropriate number of clusters of tumor types with similar profiles of cell surface antigens.

Download Full-text

Missing Data - Better "Not to Have Them", but What If You Do? (Part 1)

Marketing ZFP ◽

10.15358/0344-1369-2019-4-21 ◽

2019 ◽

Vol 41 (4) ◽

pp. 21-32

Author(s):

Dirk Temme ◽

Sarah Jensen

Keyword(s):

Missing Data ◽

Statistical Power ◽

Missing Values ◽

Graphical Representation ◽

Marketing Research ◽

Likelihood Estimation ◽

Parameter Estimates ◽

Full Information Maximum Likelihood ◽

Definition Of ◽

Traditional Approaches

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.

Download Full-text

Bicycle Speed Modelling Considering Cyclist Characteristics, Vehicle Type and Track Attributes

World Electric Vehicle Journal ◽

10.3390/wevj12010043 ◽

2021 ◽

Vol 12 (1) ◽

pp. 43

Author(s):

Xingchen Yan ◽

Xiaofei Ye ◽

Jun Chen ◽

Tao Wang ◽

Zhen Yang ◽

...

Keyword(s):

Information Criterion ◽

Lateral Position ◽

Extreme Value ◽

Optimal Number ◽

Generalized Extreme Value ◽

Vehicle Type ◽

Lane Width ◽

Bike Sharing ◽

Mode Of Transport ◽

Best Fit

Cycling is an increasingly popular mode of transport as part of the response to air pollution, urban congestion, and public health issues. The emergence of bike sharing programs and electric bicycles have also brought about notable changes in cycling characteristics, especially cycling speed. In order to provide a better basis for bicycle-related traffic simulations and theoretical derivations, the study aimed to seek the best distribution for bicycle riding speed considering cyclist characteristics, vehicle type, and track attributes. K-means clustering was performed on speed subcategories while selecting the optimal number of clustering using L method. Then, 15 common models were fitted to the grouped speed data and Kolmogorov–Smirnov test, Akaike information criterion, and Bayesian information criterion were applied to determine the best-fit distribution. The following results were acquired: (1) bicycle speed sub-clusters generated by the combinations of bicycle type, bicycle lateral position, gender, age, and lane width were grouped into three clusters; (2) Among the common distribution, generalized extreme value, gamma and lognormal were the top three models to fit the three clusters of speed dataset; and (3) integrating stability and overall performance, the generalized extreme value was the best-fit distribution of bicycle speed.

Download Full-text

Dynamic model updating (DMU) approach for statistical learning model building with missing data

BMC Bioinformatics ◽

10.1186/s12859-021-04138-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Rahi Jain ◽

Wei Xu

Keyword(s):

Missing Data ◽

Dynamic Model ◽

Statistical Models ◽

Missing Values ◽

Model Building ◽

Model Updating ◽

Biological Data ◽

Bayesian Regression ◽

Biological Research ◽

Original Dataset

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text