Imputing Cross-Sectional Missing Data: Comparison of Common Techniques

Objective: Increasing awareness of how missing data affects the analysis of clinical and public health interventions has led to increasing numbers of missing data procedures. There is little advice regarding which procedures should be selected under different circumstances. This paper compares six popular procedures: listwise deletion, item mean substitution, person mean substitution at two levels, regression imputation and hot deck imputation. Method: Using a complete dataset, each was examined under a variety of sample sizes and differing levels ofmissing data. The criteria were the true t-values for the entire sample. Results: The results suggest important differences. Ifmissing data are from a scale where about half the items are present, hot deck imputation or person mean substitution are best. Because person mean substitution is computationally simpler, similar in its efficiency, advocated by other researchers and more likely to be an option on statistical software packages, it is the method of choice. If the missing data are from a scale where more than half the items are missing, or with single-item measures, then hot deck imputation is recommended. The findings also showed that listwise deletion and item mean substitution performed poorly. Conclusions: Person mean and hot deck imputation are preferred. Since listwise deletion and item mean substitution performed poorly, yet are the most widely reported methods, the findings have broad implications.

Download Full-text

Imputing Missing Repeated Measures Data: How Should We Proceed?

Australian & New Zealand Journal of Psychiatry ◽

10.1080/j.1440-1614.2005.01629.x ◽

2005 ◽

Vol 39 (7) ◽

pp. 575-582 ◽

Cited By ~ 37

Author(s):

Peter Elliott ◽

Graeme Hawthorne

Keyword(s):

Missing Data ◽

Repeated Measures ◽

Statistical Tests ◽

Average Value ◽

Listwise Deletion ◽

A Value ◽

Software Packages ◽

Repeated Measures Data ◽

Complete Dataset ◽

T Values

Objective: This paper compares six missing data methods that can be used for carrying out statistical tests on repeated measures data: listwise deletion, last value carried forward (LVCF), standardized score imputation, regression and two versions of a closest match method. Method: The efficacy of each was investigated under a variety of sample sizes and with differing levels of missingness. Randomly selected samples from a dataset (n=804) were used to compare the methods using t-tests. Efficacy was defined as the closeness of the estimated t-values to the true t-values from the complete dataset. Results: The results suggest a reliable and efficacious basis for imputation method for repeated measures data is to substitute a missing datum with a value from another individual who has the closest scores on the same variable measured at other timepoints, or the average value of four individuals who have the closest scores on the same variable at other timepoints. The LVCF and standardized score methods performed relatively poorly, which is of concern since these are often recommended. Listwise deletion was also an inefficient missing data method. Conclusions: Researchers should consider using closest matchmissing data imputation. Since listwise deletion performed poorly, is widely reported and is the default method in many statistical software packages, the findings have broad implications.

Download Full-text

Goodbye, Listwise Deletion: Presenting Hot Deck Imputation as an Easy and Effective Tool for Handling Missing Data

Communication Methods and Measures ◽

10.1080/19312458.2011.624490 ◽

2011 ◽

Vol 5 (4) ◽

pp. 297-310 ◽

Cited By ~ 248

Author(s):

Teresa A. Myers

Keyword(s):

Missing Data ◽

Hot Deck Imputation ◽

Listwise Deletion

Download Full-text

Deriving Correlation Matrices for Missing Financial Time-Series Data

International Journal of Economics and Finance ◽

10.5539/ijef.v10n10p105 ◽

2018 ◽

Vol 10 (10) ◽

pp. 105

Author(s):

Schalk Burger ◽

Searle Silverman ◽

Gary van Vuuren

Keyword(s):

Time Series ◽

Missing Data ◽

Statistical Power ◽

Time Series Data ◽

Financial Time Series ◽

Series Data ◽

Financial Time ◽

Listwise Deletion ◽

Original Dataset ◽

Complete Dataset

The problem of missing data is prevalent in financial time series, particularly data such as foreign exchange rates and interest rate indices. Reasons for missing data include the clo-sure of financial markets over weekends and holidays and that sometimes, index data do not change between consecutive dates, resulting in stale data (also considered as missing data). Most statistical software packages function best when applied to complete da-tasets. Listwise deletion – a commonly-used approach to deal with missing data, is straightforward to use and implement, but it can exclude large portions of the original dataset (Allison, 2002). Where data are randomly missing or if the deleted data are insignificant (measured by statistical power), listwise deletion may add value. Techniques to handle missing data were suggested and implemented. These techniques were assessed to ascertain which provided the most accurate reconstructed datasets compared with complete dataset.

Download Full-text

IMPLEMENTATION OF MISSING VALUES HANDLING METHOD FOR EVALUATING THE SYSTEM/COMPONENT MAINTENANCE HISTORICAL DATA

JURNAL TEKNOLOGI REAKTOR NUKLIR TRI DASA MEGA ◽

10.17146/tdm.2017.19.1.3159 ◽

2017 ◽

Vol 19 (1) ◽

pp. 11 ◽

Cited By ~ 1

Author(s):

Entin Hartini

Keyword(s):

Machine Learning ◽

Missing Data ◽

Missing Values ◽

Historical Data ◽

Data Evaluation ◽

System Component ◽

Missing Value ◽

Listwise Deletion ◽

Mean Substitution ◽

Handling Method

Missing values are problems in data evaluation. Missing values analysis can resolve the problem of incomplete data that is not stored properly. The missing data can reduce the precision of calculation, since the amount of information is incomplete. The purpose of this study is to implement missing values handling method for systems/components maintenance historical data evaluation in RSG GAS. Statistical methods, such as listwise deletion and mean substitution, and machine learning (KNNI), were used to determine the missing data that correspond to the systems/components maintenance historical data. Mean substitution and KNNI methods were chosen since those methods do not require the formation of predictive models for each item which is experiencing missing data. Implementation of missing data analysis on systems/components maintenance data using KNNI method results in the smallest RMSE value. The result shows that KNNI method is the best method to handle missing value compared with listwise deletion or mean substitution.Keywords: missing value, data evaluation, alghorithm, implementation IMPLEMENTASI METODE PENANGANAN DATA HILANG UNTUK MENGEVALUASI DATA SEJARAH PERAWATAN SISTEM/KOMPONEN. Data hilang merupakan masalah dalam melakukan evaluasi data. Analisis data hilang dapat menyelesaikan permasalahan ketidaklengkapan data yang tidak tersimpan dengan baik. Data yang hilang akan memperkecil presisi dari perhitungan, dikarenakan jumlah informasi yang tidak lengkap. Tujuan dari penelitian ini adalah implementasi metode penanganan data hilang untuk evaluasi data sejarah perawatan sistem/komponen RSG GAS. Metodologi yang digunakan untuk menentukan data hilang yang berhubungan dengan data sejarah perawatan sistem/komponen adalah statistics, listwise deletion dan mean substitution, dan machine learning (KNNI). Metode mean substitution dan KNNI dipilih karena metode ini tidak memerlukan informasi untuk pembentukan model prediksi untuk setiap item yang mengandung data hilang. Implementasi analisis data hilang pada data perawatan sistem/komponen menggunakan metode KNNI menghasilkan nilai RMSE terkecil. Hasil ini menunjukan bahwa metode KNNI merupakan metode terbaik untuk menangani data hilang dibanding dengan listwise deletion atau mean substitution.Kata kunci: data hilang, evaluasi data, algoritma, implementasi

Download Full-text

Simple Relationship Quality Measures

10.31234/osf.io/26ytu ◽

2020 ◽

Author(s):

Sylvia Niehuis

Keyword(s):

Convergent Validity ◽

Online Survey ◽

Romantic Partners ◽

Age Dating ◽

Simple Relationship ◽

Cross Sectional ◽

Single Item ◽

College Age ◽

Close Relationship ◽

Dating Couples

Issues in applied survey research, including minimizing respondent burden to encourage survey completion and the increasing administration of questionnaires over smartphones, have intensified efforts to create short measures. We conducted two studies to examine the psychometric properties of single-item measures of four close-relationship variables: satisfaction, love, conflict, and commitment. Study 1 was longitudinal, surveying an initial sample of 121 college-age dating couples at three monthly phases. Romantic partners completed single- and multi-item measures of the four constructs, along with other variables, to examine test-retest reliability and convergent (single-item measures with their corresponding multi-item scales), concurrent, and predictive validity. Our single-item measures of satisfaction, love, and commitment exhibited impressive psychometric qualities, but our single-item conflict measure performed somewhat less strongly. Study 2, a cross-sectional online survey (n = 280; mainly through Facebook), showed strong convergent validity of the single-item measures, including a .60 correlation between single- and multi-item conflict measures.

Download Full-text

Missing Data Methods: Cross-sectional Methods and Applications

10.1108/s0731-9053(2011)27_part_1 ◽

2011 ◽

Cited By ~ 1

Keyword(s):

Missing Data ◽

Cross Sectional

Download Full-text

Combined effect of hypertension and hyperuricemia on ischemic stroke in a rural Chinese population

BMC Public Health ◽

10.1186/s12889-021-10858-x ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Peng Sun ◽

Mengqi Chen ◽

Xiaofan Guo ◽

Zhao Li ◽

Ying Zhou ◽

...

Keyword(s):

Ischemic Stroke ◽

Missing Data ◽

Chinese Population ◽

Joint Effect ◽

Cross Sectional Study ◽

Combined Effect ◽

Sectional Study ◽

Cross Sectional ◽

Current Smoking ◽

Positive Correlations

Abstract Background To investigate the combined effect of hypertension and hyperuricemia to the risk of ischemic stroke in a rural Chinese population. Methods The cross-sectional study was conducted from 2012 to 2013 in a rural area of China. After exclusion for missing data, we finally included 11,731 participants into analysis. Results After adjusting for age, current smoking, current drinking, BMI, TG, HDL-C and eGFR, hypertension was significantly associated with ischemic stroke in men (OR: 2.783, 95% CI: 1.793, 4.320) and in women (OR: 4.800, 95% CI: 2.945, 7.822). However, hyperuricemia was significantly associated with ischemic stroke only in women (OR: 1.888, 95% CI: 1.244, 2.864). After full adjustment, participants with both hypertension and hyperuricemia had 8.9 times higher risk than those without them. Finally, the interaction between hypertension and hyperuricemia was statistically significant only in women rather than in men after full adjustment. Conclusions This study demonstrated the positive correlations between hypertension, hyperuricemia and ischemic stroke. Our study also demonstrated the joint effect between hypertension and hyperuricemia towards ischemic stroke only in women, not in men.

Download Full-text

An easy way to create duration variables in binary cross-sectional time-series data

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x20976322 ◽

2020 ◽

Vol 20 (4) ◽

pp. 916-930

Author(s):

Andrew Q. Philips

Keyword(s):

Time Series ◽

Missing Data ◽

Time Series Data ◽

Series Data ◽

Duration Dependence ◽

Cross Sectional ◽

Common Solution ◽

The Common

In cross-sectional time-series data with a dichotomous dependent variable, failing to account for duration dependence when it exists can lead to faulty inferences. A common solution is to include duration dummies, polynomials, or splines to proxy for duration dependence. Because creating these is not easy for the common practitioner, I introduce a new command, mkduration, that is a straightforward way to generate a duration variable for binary cross-sectional time-series data in Stata. mkduration can handle various forms of missing data and allows the duration variable to easily be turned into common parametric and nonparametric approximations.

Download Full-text

Potential value of the current mental health monitoring of children in state care in England

BJPsych Open ◽

10.1192/bjo.2018.70 ◽

2018 ◽

Vol 4 (6) ◽

pp. 486-491 ◽

Cited By ~ 3

Author(s):

Christine Cocker ◽

Helen Minnis ◽

Helen Sweeting

Keyword(s):

Mental Health ◽

Missing Data ◽

Mental Health Problems ◽

Population Level ◽

Cross Sectional ◽

Data Set ◽

Looked After Children ◽

Education Data ◽

Sample Attrition ◽

Aggregate Population

BackgroundRoutine screening to identify mental health problems in English looked-after children has been conducted since 2009 using the Strengths and Difficulties Questionnaire (SDQ).AimsTo investigate the degree to which data collection achieves screening aims (identifying scale of problem, having an impact on mental health) and the potential analytic value of the data set.MethodDepartment for Education data (2009–2017) were used to examine: aggregate, population-level trends in SDQ scores in 4/5- to 16/17-year-olds; representativeness of the SDQ sample; attrition in this sample.ResultsMean SDQ scores (around 50% ‘abnormal’ or ‘borderline’) were stable over 9 years. Levels of missing data were high (25–30%), as was attrition (28% retained for 4 years). Cross-sectional SDQ samples were not representative and longitudinal samples were biased.ConclusionsMental health screening appears justified and the data set has research potential, but the English screening programme falls short because of missing data and inadequate referral routes for those with difficulties.Declaration of interestNone.

Download Full-text

Knowledge and Beliefs Regarding Harm From Specific Tobacco Products: Findings From the H.I.N.T. Survey

American Journal of Health Promotion ◽

10.1177/08901171211026116 ◽

2021 ◽

pp. 089011712110261

Author(s):

Wenxue Lin ◽

Joshua E. Muscat

Keyword(s):

Electronic Cigarettes ◽

Multiple Linear Regression Model ◽

Cross Sectional ◽

Cigarette Smokers ◽

Tobacco Products ◽

Single Item ◽

Knowledge And Beliefs ◽

Nationally Representative ◽

National Trends ◽

Medical Recommendations

Purpose: Determine whether dual tobacco users have different levels of knowledge about nicotine addiction, perceived harm beliefs of low nicotine cigarettes (LNCs) and beliefs about electronic cigarettes (e-cigarettes) Design: Quantitative, Cross-sectional Setting: Health Information National Trends Survey 5 (Cycle 3, 2019) Participants: Nationally representative adult non-smokers (n=3113), exclusive cigarette smokers (n=302), and dual (cigarette and e-cigarette) users (n=77). Measures: The survey included single item measures on whether nicotine causes addiction and whether nicotine causes cancer. A five-point Likert scale assessed comparative harm of e-cigarettes and LNCs relative to conventional combustible cigarettes (1=much more harmful, 3=equally harmful…5 = much less harmful, or don’t know). Analysis: We used weighted multiple linear regression model to estimate means and 95% confidence intervals (CI) of e-cigarettes and LNCs beliefs by current tobacco user status. Results: Over 97% of dual users, 83% of non-smokers and 86% of exclusive cigarette smokers correctly identified that nicotine is addictive. The majority of subjects incorrectly identified nicotine as a cause of cancer, with dual users having the lowest proportion of incorrect responses (60%). Dual users rated e-cigarette harmfulness as less harmful than combustibles (mean=2.20; 95% CI=1.73, 2.66) while exclusive cigarette smokers and non-smokers rated them as similarly harmful. LNCs were considered equally harmful and addictive as conventional cigarettes. Conclusion: Dual users had a higher knowledge base of tobacco-related health effects. The effectiveness of policies or medical recommendations to encourage smokers to switch from cigarettes to LNCs or e-cigarettes will need to consider accurate and inaccurate misperceptions about the harm and addictiveness of nicotine. Improved public health messages about different tobacco products are needed.

Download Full-text