Aspects of Data Quality in Psychology: Missing Data and Aberrant Responses

Data preprocessing is an integral step prior to analyzing data in the social sciences. The purpose of this article is to report the current practices psychological researchers use to address data preprocessing or quality concerns with a focus on issues pertaining to aberrant responses and missing data in self report measures. 240 articles were sampled from four journals: Psychological Science, Journal of Personality and Social Psychology, Developmental Psychology, and Abnormal Psychology from 2012 to 2018. We found that nearly half of the studies did not report any missing data treatment (111/240; 46.25%) and if they did, the most common approach to handle missing data was listwise deletion (71/240; 29.6%). Studies that remove data due to missingness removed, on average, 12% of the sample. We also found that most studies do not report any methodology to address aberrant responses (194/240; 80.83%). For studies that reported issues with aberrant responses, a study would classify 4% of the sample, on average, as suspect responses. These results suggest that most studies are either not transparent enough about their data preprocessing steps or maybe leveraging suboptimal procedures. We outline recommendations for researchers to improve the transparency and/or the data quality of their study.

Download Full-text

Panel Effects: Do the Reports of Panel Respondents Get Better or Worse over Time?

Journal of Survey Statistics and Methodology ◽

10.1093/jssam/smy021 ◽

2018 ◽

Vol 7 (4) ◽

pp. 572-588

Author(s):

Hanyu Sun ◽

Roger Tourangeau ◽

Stanley Presser

Keyword(s):

Missing Data ◽

Data Quality ◽

General Social Survey ◽

Panel Survey ◽

Social Survey ◽

Counter Terrorism ◽

Item Functioning ◽

Panel Effects ◽

Over Time

Abstract It is well established that taking part in earlier rounds of a panel survey can affect how respondents answer questions in later rounds. It is less clear, however, whether panel participation affects the quality of the data that respondents provide. We examined two panels to investigate how participation affects several indicators of data quality—including straightlining, item missing data, scale reliabilities, and differences in item functioning over time—and to test the hypotheses that it is less educated and older respondents who mainly account for any panel effects. The two panels were the GfK Knowledge Panel, in which some respondents completed up to four rounds measuring their attitudes toward terrorism and ways to counter terrorism, and the General Social Survey (GSS), in which respondents completed up to three rounds with an omnibus set of questions. The two panels differ sharply in terms of response rates and the level of prior survey experience of the respondents. Most of our comparisons are within-respondent, comparing the answers panel members gave in earlier rounds with those they gave in later rounds, but we also confirm the main results using between-subject comparisons. We find little evidence that respondents gave either better or worse data over time in either panel and little support for either the education or age hypotheses.

Download Full-text

Economic Aspects of the Missing Data Problem – the Case of the Patient Registry

Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis ◽

10.11118/actaun201765051779 ◽

2017 ◽

Vol 65 (5) ◽

pp. 1779-1791

Author(s):

Hatice Uenal ◽

David Hampel

Keyword(s):

Missing Data ◽

Data Quality ◽

Quality Analysis ◽

Quality Of Data ◽

Quality Costs ◽

Missing Data Problem ◽

Study Results ◽

The Cost ◽

Cost Factors

Registries are indispensable in medical studies and provide the basis for reliable study results for research questions. Depending on the purpose of use, a high quality of data is a prerequisite. However, with increasing registry quality, costs also increase accordingly. Considering these time and cost factors, this work is an attempt to estimate the cost advantages of applying statistical tools to existing registry data, including quality evaluation. Results for quality analysis showed that there are unquestionable savings of millions in study costs by reducing the time horizon and saving on average € 523,126 for every reduced year. Replacing additionally the over 25 % missing data in some variables, data quality was immensely improved. To conclude, our findings showed dearly the importance of data quality and statistical input in avoiding biased conclusions due to incomplete data.

Download Full-text

Comparing the Data Quality of GPS Devices and Smartphones for Assessing Relationships between Place, Mobility, and Health: A Field Study (Preprint)

10.2196/preprints.9771 ◽

2018 ◽

Author(s):

Robert Goodspeed ◽

Xiang Yan ◽

Jean Hardy ◽

V.G. Vinod Vydiswaran ◽

Veronica J. Berrocal ◽

...

Keyword(s):

Missing Data ◽

Data Quality ◽

Mobile Devices ◽

Fast Food ◽

Chronically Ill ◽

Well Being ◽

Activity Space ◽

Trace Data ◽

Gps Devices

BACKGROUND Mobile devices are increasingly used to collect location-based information from individuals about their physical activities, dietary intake, environmental exposures, and mental well-being. Such research, which typically uses wearable devices or smartphones to track location, benefits from the growing availability of fine-grained data regarding human mobility. However, little is known about the comparative geospatial accuracy of such devices. OBJECTIVE In this study, we compared the data quality of location information collected from two mobile devices which determine location in different ways — a GPS watch and a smartphone with Google’s Location History feature enabled. METHODS Twenty-one chronically ill participants carried both devices, which generated digital traces of locations, for 28 days. A smartphone-based brief ecological momentary assessment (EMA) survey asked participants to manually report their location at four random times throughout each day. Participants also took also part in qualitative interviews and completed surveys twice during the study period in which they reviewed recent phone and watch trace data to compare the devices’ trace data to their memory of their activities on those days. Trace data from the devices were compared on the basis of: (1) missing data days; (2) reasons for missing data; (3) distance between the route data collected for matching day and the associated EMA survey locations; and (4) activity space total area and density surfaces. RESULTS The watch resulted in a much higher proportion of missing data days, with missing data explained by technical differences between the devices, as well as participant behaviors. The phone was significantly more accurate in detecting home locations, and marginally significantly more accurate for all types of locations combined. The watch data resulted in a smaller activity space area and more accurately recorded outdoor travel and recreation. CONCLUSIONS The most suitable mobile device for location based health research depends on the particular study objectives. Further, data generated from mobile devices, such as GPS phone and smart watches, requires careful analysis to ensure quality and completeness. Studies that seek precise measurement of outdoor activity and travel, such as measuring outdoor physical activity or exposure to localized environmental hazards, would benefit from use of GPS devices. Conversely, studies that aim to account for time within buildings at home or work, or that document visits to particular places (such as supermarkets, medical facilities, or fast food restaurants), would benefit from the phone’s demonstrated greater precision in recording indoor activities. CLINICALTRIAL N/A

Download Full-text

Performance Evaluation of Various Missing Data Treatments in Crash Severity Modeling

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198118798485 ◽

2018 ◽

Vol 2672 (38) ◽

pp. 149-159

Author(s):

Fan Ye ◽

Yong Wang

Keyword(s):

Missing Data ◽

Data Quality ◽

Missing Values ◽

Probit Model ◽

Crash Severity ◽

Missing Information ◽

Complete Case ◽

Crash Data ◽

Missing Data Treatments ◽

Missing Data Treatment

Data quality, including record inaccuracy and missingness (incompletely recorded crashes and crash underreporting), has always been of concern in crash data analysis. Limited efforts have been made to handle some specific aspects of crash data quality problems, such as using weights in estimation to take care of unreported crash data and applying multiple imputation (MI) to fill in missing information of drivers’ status of attention before crashes. Yet, there lacks a general investigation of the performance of different statistical methods to handle missing crash data. This paper is intended to explore and evaluate the performance of three missing data treatments, which are complete-case analysis (CC), inverse probability weighting (IPW) and MI, in crash severity modeling using the ordered probit model. CC discards those crash records with missing information on any of the variables; IPW includes weights in estimation to adjust for bias, using complete records’ probability of being a complete case; and MI imputes the missing values based on the conditional distribution of the variable with missing information on the observed data. Those missing data treatments provide varying performance in model estimations. Based on analysis of both simulated and real crash data, this paper suggests that the choice of an appropriate missing data treatment should be based on sample size and data missing rate. Meanwhile, it is recommended that MI is used for incompletely recorded crash data and IPW for unreported crashes, before applying crash severity models on crash data.

Download Full-text

Attrition in developmental psychology

International Journal of Behavioral Development ◽

10.1177/0165025415618275 ◽

2016 ◽

Vol 41 (1) ◽

pp. 143-153 ◽

Cited By ~ 17

Author(s):

Jody S. Nicholson ◽

Pascal R. Deboeck ◽

Waylon Howard

Keyword(s):

Missing Data ◽

Best Practices ◽

Developmental Psychology ◽

American Psychological Association ◽

Simulated Data ◽

Drop Out ◽

Sufficient Information ◽

Quality Of Reporting ◽

Reporting Practices

Inherent in applied developmental sciences is the threat to validity and generalizability due to missing data as a result of participant drop-out. The current paper provides an overview of how attrition should be reported, which tests can examine the potential of bias due to attrition (e.g., t-tests, logistic regression, Little's MCAR test, sensitivity analysis), and how it is best corrected through modern missing data analyses. To amend this discussion of best practices in managing and reporting attrition, an assessment of how developmental sciences currently handle attrition was conducted. Longitudinal studies ( n = 541) published from 2009–2012 in major developmental journals were reviewed for attrition reporting practices and how authors handled missing data based on recommendations in the Publication Manual of the American Psychological Association (APA, 2010). Results suggest attrition reporting is not following APA recommendations, quality of reporting did not improve since the APA publication, and a low proportion of authors provided sufficient information to convey that data properly met the MAR assumption. An example based on simulated data demonstrates bias that may result from various missing data mechanisms in longitudinal data, the utility of auxiliary variables for the MAR assumption, and the need for viewing missingness along a continuum from MAR to MNAR.

Download Full-text

Comparative Study of Three Imputation Methods to Treat Missing Values

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v11i7.3472 ◽

2013 ◽

Vol 11 (7) ◽

pp. 2779-2786

Author(s):

Rahul Singhai

Keyword(s):

Data Mining ◽

Missing Data ◽

Missing Values ◽

Learning Algorithm ◽

Poor Quality ◽

Imputation Method ◽

Data Set ◽

Imputation Methods ◽

Missing Data Treatment

One relevant problem in data preprocessing is the presence of missing data that leads the poor quality of patterns, extracted after mining. Imputation is one of the widely used procedures that replace the missing values in a data set by some probable values. The advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. This paper analyzes the various imputation methods proposed in the field of statistics with respect to data mining. A comparative analysis of three different imputation approaches which can be used to impute missing attribute values in data mining are given that shows the most promising method. An artificial input data (of numeric type) file of 1000 records is used to investigate the performance of these methods. For testing the significance of these methods Z-test approach were used.

Download Full-text

A Preliminary Study of Short-Term Sexual Function and Satisfaction Among Men Post-Myocardial Infarction

Journal of Holistic Nursing ◽

10.1177/08980101211038085 ◽

2021 ◽

pp. 089801012110380

Author(s):

Asa B. Smith ◽

Debra L. Barton ◽

Matthew Davis ◽

Elizabeth A. Jackson ◽

Jacqui Smith ◽

...

Keyword(s):

Myocardial Infarction ◽

Sexual Function ◽

Self Report ◽

Post Discharge ◽

Adult Men ◽

The Social ◽

Male Sexual Function ◽

And Function ◽

Satisfaction Subscale

Sexuality is an important component of holistic quality of life, and myocardial infarction (MI) negatively influences many aspects of sexuality, including sexual function. However, there is limited literature that examines sexuality beyond the most basic physical components. This pilot study aimed to describe the relationships between the physical, psychologic, and social domains of holistic sexuality at an early timepoint post-MI. Adult men post-MI were mailed self-report surveys at 2 weeks post discharge. Physical domains of sexuality were measured with the arousal, orgasm, erection, lubrication, and pain subscales of the Male Sexual Function Index (MSFI). The social domain utilized the sexual satisfaction subscale of the MSFI. The psychologic domain included the desire subscale of the MSFI and sexual fear (Multidimensional Sexuality Questionnaire). Spearman correlations were estimated to examine associations among the different measurement subscales. Twenty-four men post-MI were analyzed. Average scores on the MSFI were 9.2 ( SD 7.7). Desire and satisfaction were the highest scoring subscales among men when compared with other subscales (i.e. erection, lubrication). There was minimal evidence supporting a relationship between sexual fear and function. Additional research is also needed with larger samples, and among women post-MI.

Download Full-text

The Effect of Seriousness and Device Use on Data Quality

Social Science Computer Review ◽

10.1177/0894439319841027 ◽

2019 ◽

Vol 38 (6) ◽

pp. 720-738

Author(s):

Anne-Roos Verbree ◽

Vera Toepoel ◽

Dominique Perada

Keyword(s):

Data Quality ◽

Demographic Characteristics ◽

Self Report ◽

Data Consistency ◽

Future Research ◽

Multiple Indicators ◽

Multiple Data ◽

Gender And Age ◽

Device Use

Nonserious, inattentive, or careless respondents pose a threat to the validity of self-report research. The current study uses data from the Growth from Knowledge Online Panel in which respondents are representative of the Dutch population in education, gender, and age over 15 years ( N = 5,077). By doing regression analyses, we investigated whether self-reported seriousness and motivation are predictive of data quality, as measured using multiple indicators (i.e., nonsubstantial values, speeding, internal data consistency, nondifferentiation, response effects). Device group and demographic characteristics (i.e., education, gender, age) were also included in these analyses to see whether they predict data quality. Moreover, it was examined whether self-reported seriousness differed by device group and demographic characteristics. The results show that self-reported seriousness and motivation significantly predict multiple data quality indicators. Data quality seems similar for different device users, although smartphone users showed less speeding. Demographic characteristics explain little of the variance in data quality. Of those, education seems to be the most consistent predictor of data quality, where lower educated respondents show lower data quality. Effect sizes for all analyses were in the small to medium range. The present study shows that self-reported seriousness can be used in online attitude survey research to detect careless respondents. Future research should clarify the nature of this relationship, for example, regarding longer surveys and different wordings of seriousness checks.

Download Full-text

Kurz erklärt: Measuring Data Changes in Data Engineering and their Impact on Explainability and Algorithm Fairness

Datenbank-Spektrum ◽

10.1007/s13222-021-00392-w ◽

2021 ◽

Author(s):

Meike Klettke ◽

Adrian Lutsch ◽

Uta Störl

Keyword(s):

Machine Learning ◽

Data Quality ◽

Data Science ◽

Data Preprocessing ◽

Learning Processes ◽

Improve Data Quality ◽

Data Engineering ◽

Research Questions

AbstractData engineering is an integral part of any data science and ML process. It consists of several subtasks that are performed to improve data quality and to transform data into a target format suitable for analysis. The quality and correctness of the data engineering steps is therefore important to ensure the quality of the overall process.In machine learning processes requirements such as fairness and explainability are essential. The answers to these must also be provided by the data engineering subtasks. In this article, we will show how these can be achieved by logging, monitoring and controlling the data changes in order to evaluate their correctness. However, since data preprocessing algorithms are part of any machine learning pipeline, they must obviously also guarantee that they do not produce data biases.In this article we will briefly introduce three classes of methods for measuring data changes in data engineering and present which research questions still remain unanswered in this area.

Download Full-text

Assessing the quality of routine HIV testing data in the community setting ‘COBATEST Network’

International Journal of STD & AIDS ◽

10.1177/0956462419857572 ◽

2019 ◽

Vol 30 (10) ◽

pp. 999-1008 ◽

Cited By ~ 1

Author(s):

J Reyes-Urueña ◽

L Fernàndez-Lopez ◽

A Montoliu ◽

A Conway ◽

L Tavoschi ◽

...

Keyword(s):

Data Quality ◽

Hiv Testing ◽

Community Setting ◽

Monitoring And Evaluation ◽

Three Dimensions ◽

Surveillance Systems ◽

Quality Of Data ◽

Independent Manner ◽

Address Data

The COBATEST Network of Community-Based Voluntary Counselling and STI/HIV Testing (CBVCT) services was created to standardise monitoring and evaluation of CBVCT services across Europe. This study aims to assess the quality of data collected in the network from 2015 to 2016. A survey was completed by 34 COBATEST Network members and an evaluation was performed of data quality based on three dimensions: transcription validity, completeness and consistency. The weakest area that we identified was data management processes. Only 8.8% of services had a written procedure to address data quality errors, 29.4% had procedures in place to resolve discrepancies and 35.3% performed quality control. We found that 41.2% of services utilised the COBATEST data, 11.8% made decisions based on the COBATEST data and 61.8% analysed their data in an independent manner for internal purposes. We conclude that while services have reliable data to support planning and management of services, improvements to quality procedures would ensure data are translated into evidence. This evidence would support further expansion of CBVCT services in the EU/EEA, including the integration of CBVCT-generated data into national surveillance systems.

Download Full-text