Reliability Estimates for IRT-Based Forced-Choice Assessment Scores

2021 ◽  
pp. 109442812199908
Author(s):  
Yin Lin

Forced-choice (FC) assessments of noncognitive psychological constructs (e.g., personality, behavioral tendencies) are popular in high-stakes organizational testing scenarios (e.g., informing hiring decisions) due to their enhanced resistance against response distortions (e.g., faking good, impression management). The measurement precisions of FC assessment scores used to inform personnel decisions are of paramount importance in practice. Different types of reliability estimates are reported for FC assessment scores in current publications, while consensus on best practices appears to be lacking. In order to provide understanding and structure around the reporting of FC reliability, this study systematically examined different types of reliability estimation methods for Thurstonian IRT-based FC assessment scores: their theoretical differences were discussed, and their numerical differences were illustrated through a series of simulations and empirical studies. In doing so, this study provides a practical guide for appraising different reliability estimation methods for IRT-based FC assessment scores.

2021 ◽  
Vol 12 ◽  
Author(s):  
Esraa Al-Shatti ◽  
Marc Ohana

Despite the popularity of the term impression management (IM) in the literature, there is no consensus as how different types of IM (direct vs. indirect) and modes of interaction (face-to-face vs. online) promote career-related outcomes. While most empirical studies focus on direct IM, individuals engage in both types of IM and interaction modes, particularly indirect IM in the online context. Indeed, recent developments suggest that online interactions now prevail over face-to-face interactions, especially during the COVID-19 pandemic. Accordingly, this study presents the first systematic literature review that differentiates between types of IM (direct vs. indirect) and modes of interaction (face-to-face vs. online) in a career development perspective. The review shows that direct IM is more widely studied in the face-to-face than online interaction mode, while indirect IM is neglected in both interaction modes. This study thus provides evidence of the need to investigate and differentiate between the different types of IM and interaction modes for career-related outcomes, highlighting some research gaps and directions for future inquiry.


2019 ◽  
Vol 79 (5) ◽  
pp. 827-854 ◽  
Author(s):  
Paul-Christian Bürkner ◽  
Niklas Schulte ◽  
Heinz Holling

Forced-choice questionnaires have been proposed to avoid common response biases typically associated with rating scale questionnaires. To overcome ipsativity issues of trait scores obtained from classical scoring approaches of forced-choice items, advanced methods from item response theory (IRT) such as the Thurstonian IRT model have been proposed. For convenient model specification, we introduce the thurstonianIRT R package, which uses Mplus, lavaan, and Stan for model estimation. Based on practical considerations, we establish that items within one block need to be equally keyed to achieve similar social desirability, which is essential for creating forced-choice questionnaires that have the potential to resist faking intentions. According to extensive simulations, measuring up to five traits using blocks of only equally keyed items does not yield sufficiently accurate trait scores and inter-trait correlation estimates, neither for frequentist nor for Bayesian estimation methods. As a result, persons’ trait scores remain partially ipsative and, thus, do not allow for valid comparisons between persons. However, we demonstrate that trait scores based on only equally keyed blocks can be improved substantially by measuring a sizable number of traits. More specifically, in our simulations of 30 traits, scores based on only equally keyed blocks were non-ipsative and highly accurate. We conclude that in high-stakes situations where persons are motivated to give fake answers, Thurstonian IRT models should only be applied to tests measuring a sizable number of traits.


2018 ◽  
Author(s):  
Paul - Christian Bürkner ◽  
Niklas Schulte ◽  
Heinz Holling

Forced-choice questionnaires have been proposed to avoid common response biases typically associated with rating scale questionnaires. To overcome ipsativity issues of trait scores obtained from classical scoring approaches of forced-choice items, advanced methods from item response theory (IRT) such as the Thurstonian IRT model have been proposed. For convenient model specification, we introduce the thurstonianIRT R package, which uses Mplus, lavaan, and Stan for model estimation. Based on practical considerations, we establish that items within one block need to be equally keyed to achieve similar social desirability, which is essential for creating force-choice questionnaires that have the potential to resist faking intentions. According to extensive simulations, measuring up to 5 traits using blocks of only equally keyed items does not yield sufficiently accurate trait scores and inter-trait correlation estimates, neither for frequentist nor Bayesian estimation methods. As a result, persons' trait scores remain partially ipsative and, thus, do not allow for valid comparisons between persons. However, we demonstrate that trait scores based on only equally keyed blocks can be improved substantially by measuring a sizeable number of traits. More specifically, in our simulations of 30 traits, scores based on only equally keyed blocks were non-ipsative and highly accurate. We conclude that in high-stakes situations where persons are motivated to give fake answers, Thurstonian IRT models should only be applied to tests measuring a sizeable number of traits.


2020 ◽  
Author(s):  
John J Shaw ◽  
Zhisen Urgolites ◽  
Padraic Monaghan

Visual long-term memory has a large and detailed storage capacity for individual scenes, objects, and actions. However, memory for combinations of actions and scenes is poorer, suggesting difficulty in binding this information together. Sleep can enhance declarative memory of information, but whether sleep can also boost memory for binding information and whether the effect is general across different types of information is not yet known. Experiments 1 to 3 tested effects of sleep on binding actions and scenes, and Experiments 4 and 5 tested binding of objects and scenes. Participants viewed composites and were tested 12-hours later after a delay consisting of sleep (9pm-9am) or wake (9am-9pm), on an alternative forced choice recognition task. For action-scene composites, memory was relatively poor with no significant effect of sleep. For object-scene composites sleep did improve memory. Sleep can promote binding in memory, depending on the type of information to be combined.


2021 ◽  
pp. 001316442110089
Author(s):  
Yuanshu Fu ◽  
Zhonglin Wen ◽  
Yang Wang

Composite reliability, or coefficient omega, can be estimated using structural equation modeling. Composite reliability is usually estimated under the basic independent clusters model of confirmatory factor analysis (ICM-CFA). However, due to the existence of cross-loadings, the model fit of the exploratory structural equation model (ESEM) is often found to be substantially better than that of ICM-CFA. The present study first illustrated the method used to estimate composite reliability under ESEM and then compared the difference between ESEM and ICM-CFA in terms of composite reliability estimation under various indicators per factor, target factor loadings, cross-loadings, and sample sizes. The results showed no apparent difference in using ESEM or ICM-CFA for estimating composite reliability, and the rotation type did not affect the composite reliability estimates generated by ESEM. An empirical example was given as further proof of the results of the simulation studies. Based on the present study, we suggest that if the model fit of ESEM (regardless of the utilized rotation criteria) is acceptable but that of ICM-CFA is not, the composite reliability estimates based on the above two models should be similar. If the target factor loadings are relatively small, researchers should increase the number of indicators per factor or increase the sample size.


Author(s):  
Kristian Miok ◽  
Blaž Škrlj ◽  
Daniela Zaharie ◽  
Marko Robnik-Šikonja

AbstractHate speech is an important problem in the management of user-generated content. To remove offensive content or ban misbehaving users, content moderators need reliable hate speech detectors. Recently, deep neural networks based on the transformer architecture, such as the (multilingual) BERT model, have achieved superior performance in many natural language classification tasks, including hate speech detection. So far, these methods have not been able to quantify their output in terms of reliability. We propose a Bayesian method using Monte Carlo dropout within the attention layers of the transformer models to provide well-calibrated reliability estimates. We evaluate and visualize the results of the proposed approach on hate speech detection problems in several languages. Additionally, we test whether affective dimensions can enhance the information extracted by the BERT model in hate speech classification. Our experiments show that Monte Carlo dropout provides a viable mechanism for reliability estimation in transformer networks. Used within the BERT model, it offers state-of-the-art classification performance and can detect less trusted predictions.


2020 ◽  
Vol 122 (4) ◽  
pp. 1-26
Author(s):  
Maria Paula Ghiso ◽  
Stephanie A. Burdick-Shepherd

Background This paper is part of the special issue “Reimagining Research and Practice at the Crossroads of Philosophy, Teaching, and Teacher Education.” Early childhood initiatives have joined a nexus of educational reforms characterized by increased accountability and a focus on measurement as a marker of student and teacher learning, with early education being framed as an economic good necessary for competing in the global marketplace. Underlying the recent push for early childhood education is what we see as a “discourse of improvement”—depictions of school change that prioritize achievement as reflected in assessment scores, data collection on teacher effectiveness, and high-stakes evaluation. These characteristics, we argue, foster increasingly inequitable educational contexts and obscure the particularities of what it means to be a child in the world. Purpose We use the practice of philosophical meditation, as articulated in Pierre Hadot's examination of philosophy as a way of life, to inquire into the logics of educational improvement as instantiated in particular contexts, and for cultivating cross-disciplinary partnerships committed to fostering children's flourishing. We link this meditational focus with feminist and de-colonial theoretical perspectives to make visible the role of power in the characterization of children's learning as related to norms of development, minoritized identities, and hierarchies of knowledge. Research Design: In this collaborative inquiry, we compose a series of meditations on our experiences with the logics of improvement inspired by 12 months of systematic conversation. Our data sources include correspondence between the two authors, written reflections on specific practices in teacher education each author engages with, and a set of literary, philosophical, and teacher education texts. Conclusions/Recommendations Our meditations illuminate the value of collective inquiry about what constitutes improvement in schools. We raise questions about how the measurement of learning is entwined in historical and present-day relations of power and idealized formulations of the universal “child” or “teacher” and argue that we must work together to reimagine the framings that inform our work. Ultimately and most directly, these meditations can support dynamic attempts to cultivate meaningful and more equitable educational experiences for teachers and students. Philosophical meditations at the crossroads of philosophy, teaching, and teacher education thus extend beyond critique toward imagining and enacting a better world in our classrooms, even though (and especially when) this path is not clear.


2018 ◽  
Vol 55 (5) ◽  
pp. 671-686 ◽  
Author(s):  
Nils-Christian Bormann ◽  
Burcu Savun

Barbara Walter’s application of reputation theory to self-determination movements has advanced our understanding of why many separatist movements result in armed conflict. Walter has shown that governments of multi-ethnic societies often respond to territorial disputes with violence to deter similar future demands by other ethnic groups. When governments grant territorial accommodation to one ethnic group, they encourage other ethnic groups to seek similar concessions. However, a number of recent empirical studies casts doubt on the validity of Walter’s argument. We address recent challenges to the efficacy of reputation building in the context of territorial conflicts by delineating the precise scope conditions of reputation theory. First, we argue that only concessions granted after fighting should trigger additional conflict onsets. Second, the demonstration effects should particularly apply to groups with grievances against the state. We then test the observable implications of our conditional argument for political power-sharing concessions. Using a global sample of ethnic groups in 120 states between 1946 and 2013, we find support for our arguments. Our theoretical framework enables us to identify the conditions under which different types of governmental concessions are likely to trigger future conflicts, and thus has important implications for conflict resolution.


2013 ◽  
pp. 294-312
Author(s):  
Wei-Min Hu ◽  
James E. Prieger

Accurate measurement of digital divides is important for policy purposes. Empirical studies of broadband subscription gaps have largely used cross-sectional data, which cannot speak to the timing of technological adoption. Yet, the dynamics of a digital divide are important and deserve study. With the goal of improving our understanding of appropriate techniques for analyzing digital divides, we review econometric methodology and propose the use of duration analysis. We compare the performance of alternative estimation methods using a large dataset on DSL subscription in the U.S., paying particular attention to whether women, blacks, and Hispanics catch up to others in the broadband adoption race. We conclude that duration analysis best captures the dynamics of the broadband gaps and is a useful addition to the analytic tool box of digital divide researchers. Our results support the official collection of broadband statistics in panel form, where the same households are followed over time.


Author(s):  
Darko Pevec ◽  
Zoran Bosnic ◽  
Igor Kononenko

Current machine learning algorithms perform well in many problem domains, but in risk-sensitive decision making – for example, in medicine and finance – experts do not rely on common evaluation methods that provide overall assessments of models because such techniques do not provide any information about single predictions. This chapter summarizes the research areas that have motivated the development of various approaches to individual prediction reliability. Based on these motivations, the authors describe six approaches to reliability estimation: inverse transduction, local sensitivity analysis, bagging variance, local cross-validation, local error modelling, and density-based estimation. Empirical evaluation of the benchmark datasets provides promising results, especially for use with decision and regression trees. The testing results also reveal that the reliability estimators exhibit different performance levels when used with different models and in different domains. The authors show the usefulness of individual prediction reliability estimates in attempts to predict breast cancer recurrence. In this context, estimating prediction reliability for individual predictions is of crucial importance for physicians seeking to validate predictions derived using classification and regression models.


Sign in / Sign up

Export Citation Format

Share Document