Performance Evaluation of Various Missing Data Treatments in Crash Severity Modeling

Author(s):  
Fan Ye ◽  
Yong Wang

Data quality, including record inaccuracy and missingness (incompletely recorded crashes and crash underreporting), has always been of concern in crash data analysis. Limited efforts have been made to handle some specific aspects of crash data quality problems, such as using weights in estimation to take care of unreported crash data and applying multiple imputation (MI) to fill in missing information of drivers’ status of attention before crashes. Yet, there lacks a general investigation of the performance of different statistical methods to handle missing crash data. This paper is intended to explore and evaluate the performance of three missing data treatments, which are complete-case analysis (CC), inverse probability weighting (IPW) and MI, in crash severity modeling using the ordered probit model. CC discards those crash records with missing information on any of the variables; IPW includes weights in estimation to adjust for bias, using complete records’ probability of being a complete case; and MI imputes the missing values based on the conditional distribution of the variable with missing information on the observed data. Those missing data treatments provide varying performance in model estimations. Based on analysis of both simulated and real crash data, this paper suggests that the choice of an appropriate missing data treatment should be based on sample size and data missing rate. Meanwhile, it is recommended that MI is used for incompletely recorded crash data and IPW for unreported crashes, before applying crash severity models on crash data.

2018 ◽  
Vol 22 (4) ◽  
pp. 391-409
Author(s):  
John M. Roberts ◽  
Aki Roberts ◽  
Tim Wadsworth

Incident-level homicide datasets such as the Supplementary Homicide Reports (SHR) commonly exhibit missing data. We evaluated multiple imputation methods (that produce multiple completed datasets, across which imputed values may vary) via unique data that included actual values, from police agency incident reports, of seemingly missing SHR data. This permitted evaluation under a real, not assumed or simulated, missing data mechanism. We compared analytic results based on multiply imputed and actual data; multiple imputation rather successfully recovered victim–offender relationship distributions and regression coefficients that hold in the actual data. Results are encouraging for users of multiple imputation, though it is still important to minimize the extent of missing information in SHR and similar data.


Author(s):  
Elsiddig Elsadig Mohamed Koko ◽  
Amin Ibrahim Adam Mohamed

<p>The missing data in household health survey was challenged for the researcher because of incomplete analysis. The statistical tool cluster analysis methodology implemented in the collected data of Sudan's household health survey in 2006.</p><p>Current research specifically focuses on the data analysis as the objective is to deal with the missing values in cluster analysis. Two-Step Cluster Analysis is applied in which each participant is classified into one of the identified pattern and the optimal number of classes is determined using SPSS Statistics/IBM. However, the risk of over-fitting of the data must be considered because cluster analysis is a multivariable statistical technique. Any observation with missing data is excluded in the Cluster Analysis because like multi-variable statistical techniques. Therefore, before performing the cluster analysis, missing values will be imputed using multiple imputations (SPSS Statistics/IBM). The clustering results will be displayed in tables. The descriptive statistics and cluster frequencies will be produced for the final cluster model, while the information criterion table will display results for a range of cluster solutions.</p>


2013 ◽  
Vol 11 (7) ◽  
pp. 2779-2786
Author(s):  
Rahul Singhai

One relevant problem in data preprocessing is the presence of missing data that leads the poor quality of patterns, extracted after mining. Imputation is one of the widely used procedures that replace the missing values in a data set by some probable values. The advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. This paper analyzes the various imputation methods proposed in the field of statistics with respect to data mining. A comparative analysis of three different imputation approaches which can be used to impute missing attribute values in data mining are given that shows the most promising method. An artificial input data (of numeric type) file of 1000 records is used to investigate the performance of these methods. For testing the significance of these methods Z-test approach were used.


2021 ◽  
Author(s):  
Maxwell Hong ◽  
Matt Carter ◽  
Cheyeon Kim ◽  
Ying Cheng

Data preprocessing is an integral step prior to analyzing data in the social sciences. The purpose of this article is to report the current practices psychological researchers use to address data preprocessing or quality concerns with a focus on issues pertaining to aberrant responses and missing data in self report measures. 240 articles were sampled from four journals: Psychological Science, Journal of Personality and Social Psychology, Developmental Psychology, and Abnormal Psychology from 2012 to 2018. We found that nearly half of the studies did not report any missing data treatment (111/240; 46.25%) and if they did, the most common approach to handle missing data was listwise deletion (71/240; 29.6%). Studies that remove data due to missingness removed, on average, 12% of the sample. We also found that most studies do not report any methodology to address aberrant responses (194/240; 80.83%). For studies that reported issues with aberrant responses, a study would classify 4% of the sample, on average, as suspect responses. These results suggest that most studies are either not transparent enough about their data preprocessing steps or maybe leveraging suboptimal procedures. We outline recommendations for researchers to improve the transparency and/or the data quality of their study.


Author(s):  
Rahime Belen ◽  
Tugba Taskaya Temizel

Many manually populated very large databases suffer from data quality problems such as missing, inaccurate data and duplicate entries. A recently recognized data quality problem is that of disguised missing data which arises when an explicit code for missing data such as NA (Not Available) is not provided and a legitimate data value is used instead. Presence of these values may affect the outcome of data mining tasks severely such that association mining algorithms or clustering techniques may result in biased inaccurate association rules and invalid clusters respectively. Detection and elimination of these values are necessary but burdensome to be carried out manually. In this chapter, the methods to detect disguised missing values by visual inspection are explained first. Then, the authors describe the methods used to detect these values automatically. Finally, the framework to detect disguised missing data is proposed and a demonstration of the framework on spatial and categorical data sets is provided.


Data Mining ◽  
2013 ◽  
pp. 603-623
Author(s):  
Rahime Belen ◽  
Tugba Taskaya Temizel

Many manually populated very large databases suffer from data quality problems such as missing, inaccurate data and duplicate entries. A recently recognized data quality problem is that of disguised missing data which arises when an explicit code for missing data such as NA (Not Available) is not provided and a legitimate data value is used instead. Presence of these values may affect the outcome of data mining tasks severely such that association mining algorithms or clustering techniques may result in biased inaccurate association rules and invalid clusters respectively. Detection and elimination of these values are necessary but burdensome to be carried out manually. In this chapter, the methods to detect disguised missing values by visual inspection are explained first. Then, the authors describe the methods used to detect these values automatically. Finally, the framework to detect disguised missing data is proposed and a demonstration of the framework on spatial and categorical data sets is provided.


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2021 ◽  
Vol 23 (1) ◽  
Author(s):  
Lisa Lindner ◽  
Anja Weiß ◽  
Andreas Reich ◽  
Siegfried Kindler ◽  
Frank Behrens ◽  
...  

Abstract Background Clinical data collection requires correct and complete data sets in order to perform correct statistical analysis and draw valid conclusions. While in randomized clinical trials much effort concentrates on data monitoring, this is rarely the case in observational studies- due to high numbers of cases and often-restricted resources. We have developed a valid and cost-effective monitoring tool, which can substantially contribute to an increased data quality in observational research. Methods An automated digital monitoring system for cohort studies developed by the German Rheumatism Research Centre (DRFZ) was tested within the disease register RABBIT-SpA, a longitudinal observational study including patients with axial spondyloarthritis and psoriatic arthritis. Physicians and patients complete electronic case report forms (eCRF) twice a year for up to 10 years. Automatic plausibility checks were implemented to verify all data after entry into the eCRF. To identify conflicts that cannot be found by this approach, all possible conflicts were compiled into a catalog. This “conflict catalog” was used to create queries, which are displayed as part of the eCRF. The proportion of queried eCRFs and responses were analyzed by descriptive methods. For the analysis of responses, the type of conflict was assigned to either a single conflict only (affecting individual items) or a conflict that required the entire eCRF to be queried. Results Data from 1883 patients was analyzed. A total of n = 3145 eCRFs submitted between baseline (T0) and T3 (12 months) had conflicts (40–64%). Fifty-six to 100% of the queries regarding eCRFs that were completely missing were answered. A mean of 1.4 to 2.4 single conflicts occurred per eCRF, of which 59–69% were answered. The most common missing values were CRP, ESR, Schober’s test, data on systemic glucocorticoid therapy, and presence of enthesitis. Conclusion Providing high data quality in large observational cohort studies is a major challenge, which requires careful monitoring. An automated monitoring process was successfully implemented and well accepted by the study centers. Two thirds of the queries were answered with new data. While conventional manual monitoring is resource-intensive and may itself create new sources of errors, automated processes are a convenient way to augment data quality.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Rahi Jain ◽  
Wei Xu

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


Sign in / Sign up

Export Citation Format

Share Document