item responses
Recently Published Documents

Missing item responses are prevalent in educational large-scale assessment studies such as the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians have advocated for a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. In an illustrative simulation study, it is shown that the Mislevy-Wu model provides unbiased model parameters. Moreover, the simulation replicates the finding from various simulation studies from the literature that scoring missing item responses as wrong provides biased estimates if the latent ignorability assumption holds in the data-generating model. However, if missing item responses are generated such that they can only be generated from incorrect item responses, applying an item response model that relies on latent ignorability results in biased estimates. The Mislevy-Wu model guarantees unbiased parameter estimates if the more general Mislevy-Wu model holds in the data-generating model. In addition, this article uses the PISA 2018 mathematics dataset as a case study to investigate the consequences of different missing data treatments on country means and country standard deviations. Obtained country means and country standard deviations can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, in the discussion section, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.

Download Full-text

Assessing Efforts to Diversify Kentucky’s K–12 Teacher Workforce: A Mixed-Methods Analysis of a Grow-Your-Own-Teacher Pathway

Journal of Education Human Resources ◽

10.3138/jehr-2021-0038 ◽

2021 ◽

Author(s):

W. Kyle Ingle ◽

Stephen M. Leach ◽

Amy S. Lingo

Keyword(s):

High School ◽

High School Students ◽

School Districts ◽

Teaching And Learning ◽

School Personnel ◽

Teaching Profession ◽

White Female ◽

School Students ◽

Item Responses ◽

K 12

We examined the characteristics of 77 high school participants from four school districts who participated in the Teaching and Learning Career Pathway (TLCP) at the University of Louisville during the 2018–2019 school year. The program seeks to support the recruitment of a diverse and effective educator workforce by recruiting high school students as potential teachers for dual-credit courses that explore the teaching profession. Utilizing descriptive and inferential analysis (χ2 tests) of closed-ended item responses as well as qualitative analysis of program documents, Web sites, and students’ open-ended item responses, we compared the characteristics of the participants with those of their home school districts and examined their perceptions of the program. When considering gender and race/ethnicity, our analysis revealed the program was unsuccessful in its first year, reaching predominantly white female high school students who were already interested in teaching. Respondents reported learning about the TLCP from school personnel, specifically, guidance counselors (39%), non-TCLP teachers (25%), or TLCP teachers (20%). We found that the TLCP program has not defined diversity in a measurable way and the lack of an explicit program theory hinders the evaluation and improvement of TLCP. Program recruitment and outcomes are the result of luck or idiosyncratic personnel recommendations rather than intentional processes. We identified a need for qualitative exploration of in-school recruitment processes and statewide longitudinal studies to track participant outcomes in college and in the teacher labor market.

Download Full-text

Application of a Longitudinal IRTree Model: Response Style Changes Over Time

Assessment ◽

10.1177/10731911211042932 ◽

2021 ◽

pp. 107319112110429

Author(s):

Allison J. Ames ◽

Brian C. Leventhal

Keyword(s):

Data Collection ◽

Response Style ◽

Adult Health ◽

Sexual Knowledge ◽

Bias Estimation ◽

Modeling Framework ◽

Extreme Response ◽

Item Responses ◽

Psychometric Models ◽

Knowledge Scale

Traditional psychometric models focus on studying observed categorical item responses, but these models often oversimplify the respondent cognitive response process, assuming responses are driven by a single substantive trait. A further weakness is that analysis of ordinal responses has been primarily limited to a single substantive trait at one time point. This study applies a significant expansion of this modeling framework to account for complex response processes across multiple waves of data collection using the item response tree (IRTree) framework. This study applies a novel model, the longitudinal IRTree, for response processes in longitudinal studies, and investigates whether the response style changes are proportional to changes in the substantive trait of interest. To do so, we present an empirical example using a six-item sexual knowledge scale from the National Longitudinal Study of Adolescent to Adult Health across two waves of data collection. Results show an increase in sexual knowledge from the first wave to the second wave and a decrease in midpoint and extreme response styles. Model validation revealed failure to account for response style can bias estimation of substantive trait growth. The longitudinal IRTree model captures midpoint and extreme response style, as well as the trait of interest, at both waves.

Download Full-text

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

10.20944/preprints202110.0107.v1 ◽

2021 ◽

Author(s):

Alexander Robitzsch

Keyword(s):

Missing Data ◽

Large Scale ◽

Model Fit ◽

Large Scale Assessment ◽

Missing Data Treatments ◽

Scale Assessment ◽

Scaling Models ◽

Item Responses ◽

Response Propensity ◽

Missing Item

Missing item responses are prevalent in educational large-scale assessment studies like the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians advocated a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. This article uses the PISA 2018 mathematics dataset to investigate the consequences of different missing data treatments on country means. Obtained country means can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.

Download Full-text

Efficient and precise Ultra-QuickDASH scale measuring lymphedema impact developed using computerized adaptive testing

Quality of Life Research ◽

10.1007/s11136-021-02979-y ◽

2021 ◽

Author(s):

Cai Xu ◽

Mark V. Schaverien ◽

Joani M. Christensen ◽

Chris J. Sidey-Gibbons

Keyword(s):

Factor Analysis ◽

Computerized Adaptive Testing ◽

Daily Activities ◽

Response Model ◽

Factor Scores ◽

Graded Response Model ◽

Local Independence ◽

Irt Model ◽

Graded Response ◽

Item Responses

Abstract Purpose This study aimed to evaluate and improve the accuracy and efficiency of the QuickDASH for use in assessment of limb function in patients with upper extremity lymphedema using modern psychometric techniques. Method We conducted confirmative factor analysis (CFA) and Mokken analysis to examine the assumption of unidimensionality for IRT model on data from 285 patients who completed the QuickDASH, and then fit the data to Samejima’s graded response model (GRM) and assessed the assumption of local independence of items and calibrated the item responses for CAT simulation. Results Initial CFA and Mokken analyses demonstrated good scalability of items and unidimensionality. However, the local independence of items assumption was violated between items 9 (severity of pain) and 11 (sleeping difficulty due to pain) (Yen’s Q3 = 0.46) and disordered thresholds were evident for item 5 (cutting food). After addressing these breaches of assumptions, the re-analyzed GRM with the remaining 10 items achieved an improved fit. Simulation of CAT administration demonstrated a high correlation between scores on the CAT and the QuickDash (r = 0.98). Items 2 (doing heavy chores) and 8 (limiting work or daily activities) were the most frequently used. The correlation among factor scores derived from the QuickDASH version with 11 items and the Ultra-QuickDASH version with items 2 and 8 was as high as 0.91. Conclusion By administering just these two best performing QuickDash items we can obtain estimates that are very similar to those obtained from the full-length QuickDash without the need for CAT technology.

Download Full-text

The GAD-7 and the PHQ-8 exhibit the same mathematical pattern of item responses in the general population: analysis of data from the National Health Interview Survey

BMC Psychology ◽

10.1186/s40359-021-00657-9 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Shinichiro Tomitaka ◽

Toshiaki A. Furukawa

Keyword(s):

Anxiety Disorder ◽

General Population ◽

Generalized Anxiety Disorder ◽

National Health ◽

National Health Interview Survey ◽

Rating Scales ◽

Generalized Anxiety ◽

Health Interview Survey ◽

Item Responses ◽

Interview Survey

Abstract Background Recent studies have shown that, among the general population, responses to depression-rating scales follow a common mathematical pattern. However, the mathematical pattern among responses to the items of the Generalized Anxiety Disorder-7 (GAD-7) is currently unknown. The present study investigated whether item responses to the GAD-7, when administered to the general population, follow the same mathematical distribution as those of depression-rating scales. Methods We used data from the 2019 National Health Interview Survey (31,997 individuals), which is a nationwide survey of adults conducted annually in the United States. The patterns of item responses to the GAD-7 and the Patient Health Questionnaire-8 (PHQ-8), respectively, were analyzed inductively. Results For all GAD-7 items, the frequency distribution for each response option (“not at all,” “several days,” “more than half the days,” and “nearly every day,” respectively) was positively skewed. Line charts representing the responses to each GAD-7 item all crossed at a single point between “not at all” and “several days” and, on a logarithmic scale, showed a parallel pattern from “several days” to “nearly every day.” This mathematical pattern among the item responses was identical to that of the PHQ-8. This characteristic pattern of the item responses developed because the values for the “more than half the days” to “several days” ratio were similar across all items, as were the values for the “nearly every day” to “more than half the days” ratio. Conclusions Our results suggest that the symptom criteria of generalized anxiety disorder and major depression have a common distribution pattern in the general population.

Download Full-text

A Multilevel Mixture IRT Framework for Modeling Response Times as Predictors or Indicators of Response Engagement in IRT Models

Educational and Psychological Measurement ◽

10.1177/00131644211045351 ◽

2021 ◽

pp. 001316442110453

Author(s):

Gabriel Nagy ◽

Esther Ulitzsch

Keyword(s):

Large Scale ◽

Latent Class ◽

Response Times ◽

Marginal Maximum Likelihood ◽

Irt Model ◽

Large Scale Data ◽

Irt Models ◽

Mixture Irt ◽

Item Parameters ◽

Item Responses

Disengaged item responses pose a threat to the validity of the results provided by large-scale assessments. Several procedures for identifying disengaged responses on the basis of observed response times have been suggested, and item response theory (IRT) models for response engagement have been proposed. We outline that response time-based procedures for classifying response engagement and IRT models for response engagement are based on common ideas, and we propose the distinction between independent and dependent latent class IRT models. In all IRT models considered, response engagement is represented by an item-level latent class variable, but the models assume that response times either reflect or predict engagement. We summarize existing IRT models that belong to each group and extend them to increase their flexibility. Furthermore, we propose a flexible multilevel mixture IRT framework in which all IRT models can be estimated by means of marginal maximum likelihood. The framework is based on the widespread Mplus software, thereby making the procedure accessible to a broad audience. The procedures are illustrated on the basis of publicly available large-scale data. Our results show that the different IRT models for response engagement provided slightly different adjustments of item parameters of individuals’ proficiency estimates relative to a conventional IRT model.

Download Full-text

Construct Validity and Clinical Utility of World Health Organization Disability Assessment Schedule 2.0 in Older Patients Discharged From Emergency Departments

Frontiers in Rehabilitation Sciences ◽

10.3389/fresc.2021.710137 ◽

2021 ◽

Vol 2 ◽

Author(s):

Louise Moeldrup Nielsen ◽

Lisa Gregersen Oestergaard ◽

Hans Kirkegaard ◽

Thomas Maribo

Keyword(s):

Construct Validity ◽

Older Patients ◽

Clinical Utility ◽

World Health ◽

Disability Assessment ◽

Floor Effect ◽

Whodas 2.0 ◽

Item Responses ◽

Health Organization ◽

Missing Item

Introduction: The World Health Organization Disability Assessment Schedule 2.0 (WHODAS 2.0) is designed to measure functioning and disability in six domains. It is included in the International Classification of Diseases 11th revision (ICD-11). The objective of the study was to examine the construct validity of WHODAS 2.0 and describe its clinical utility for the assessment of functioning and disability among older patients discharged from emergency departments (EDs).Material and Methods: This cross-sectional study is based on data from 129 older patients. Patients completed the 36-item version of WHODAS 2.0 together with the Barthel-20, the Assessment of Motor and Process Skills (AMPS), Timed Up and Go (TUG), and the 30-Second Chair Stand Test (30 s-CST). Construct validity was examined through hypothesis testing by correlating the WHODAS with the other instruments and specifically the mobility domain in WHODAS 2.0 with the TUG and 30 s-CST tests. The clinical utility of WHODAS 2.0 was explored through floor/ceiling effect and missing item responses.Results: WHODAS 2.0 correlated fair with Barthel-20 (r = −0.49), AMPS process skills (r = −0.26) and TUG (r=0.30) and correlated moderate with AMPS motor skills (r = −0.58) and 30s-CST (r = −0.52). The WHODAS 2.0 mobility domain correlated fair with TUG (r = 0.33) and moderate with 30s-CST (r = −0.60). Four domains demonstrated floor effect: D1 “Cognition,” D3 “Self-care,” D4 “Getting along,” and D5 “Household.” Ceiling effect was not identified. The highest proportion of missing item responses were present for Item 3.4 (Staying by yourself for a few days), Item 4.4 (Making new friends), and Item 4.5 (Sexual activities).Conclusion: WHODAS 2.0 had fair-to-moderate correlations with Barthel-20, AMPS, TUG, and 30s-CST and provides additional aspects of disability compared with commonly used instruments. However, the clinical utility of WHODAS 2.0 applied to older patients discharged from EDs poses some challenges due to floor effect and missing item responses. Accordingly, patient and health professional perspectives need further investigation.

Download Full-text

Item Response Ranking for Cognitive Diagnosis

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/241 ◽

2021 ◽

Author(s):

Shiwei Tong ◽

Qi Liu ◽

Runlong Yu ◽

Wei Huang ◽

Zhenya Huang ◽

...

Keyword(s):

Item Response ◽

Sampling Method ◽

Sampling Methods ◽

Cognitive Diagnosis ◽

Cognitive Diagnosis Models ◽

Proficiency Level ◽

Pairwise Learning ◽

Well Model ◽

Item Responses ◽

The Right

Cognitive diagnosis, a fundamental task in education area, aims at providing an approach to reveal the proficiency level of students on knowledge concepts. Actually, monotonicity is one of the basic conditions in cognitive diagnosis theory, which assumes that student's proficiency is monotonic with the probability of giving the right response to a test item. However, few of previous methods consider the monotonicity during optimization. To this end, we propose Item Response Ranking framework (IRR), aiming at introducing pairwise learning into cognitive diagnosis to well model the monotonicity between item responses. Specifically, we first use an item specific sampling method to sample item responses and construct response pairs based on their partial order, where we propose the two-branch sampling methods to handle the unobserved responses. After that, we use a pairwise objective function to exploit the monotonicity in the pair formulation. In fact, IRR is a general framework which can be applied to most of contemporary cognitive diagnosis models. Extensive experiments demonstrate the effectiveness and interpretability of our method.

Download Full-text

An Attention-Based Diffusion Model for Psychometric Analyses

Psychometrika ◽

10.1007/s11336-021-09783-0 ◽

2021 ◽

Author(s):

Udo Boehm ◽

Maarten Marsman ◽

Han L. J. van der Maas ◽

Gunter Maris

Keyword(s):

Diffusion Model ◽

Cognitive Process ◽

Functional Form ◽

Response Times ◽

Psychometric Tests ◽

Psychometric Analyses ◽

Item Responses ◽

Computer Based ◽

Psychometric Models ◽

Source Of Information

AbstractThe emergence of computer-based assessments has made response times, in addition to response accuracies, available as a source of information about test takers’ latent abilities. The development of substantively meaningful accounts of the cognitive process underlying item responses is critical to establishing the validity of psychometric tests. However, existing substantive theories such as the diffusion model have been slow to gain traction due to their unwieldy functional form and regular violations of model assumptions in psychometric contexts. In the present work, we develop an attention-based diffusion model based on process assumptions that are appropriate for psychometric applications. This model is straightforward to analyse using Gibbs sampling and can be readily extended. We demonstrate our model’s good computational and statistical properties in a comparison with two well-established psychometric models.

Download Full-text

item responsesRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data

Assessing Efforts to Diversify Kentucky’s K–12 Teacher Workforce: A Mixed-Methods Analysis of a Grow-Your-Own-Teacher Pathway

Application of a Longitudinal IRTree Model: Response Style Changes Over Time

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

Efficient and precise Ultra-QuickDASH scale measuring lymphedema impact developed using computerized adaptive testing

The GAD-7 and the PHQ-8 exhibit the same mathematical pattern of item responses in the general population: analysis of data from the National Health Interview Survey

A Multilevel Mixture IRT Framework for Modeling Response Times as Predictors or Indicators of Response Engagement in IRT Models

Construct Validity and Clinical Utility of World Health Organization Disability Assessment Schedule 2.0 in Older Patients Discharged From Emergency Departments

Item Response Ranking for Cognitive Diagnosis

An Attention-Based Diffusion Model for Psychometric Analyses

item responses
Recently Published Documents