Reliability Estimates for IRT-Based Forced-Choice Assessment Scores

Organizational Research Methods ◽

10.1177/1094428121999086 ◽

2021 ◽

pp. 109442812199908

Author(s):

Yin Lin

Keyword(s):

Impression Management ◽

Empirical Studies ◽

Forced Choice ◽

Reliability Estimation ◽

Estimation Methods ◽

High Stakes ◽

Personnel Decisions ◽

Assessment Scores ◽

Reliability Estimates ◽

Different Types

Forced-choice (FC) assessments of noncognitive psychological constructs (e.g., personality, behavioral tendencies) are popular in high-stakes organizational testing scenarios (e.g., informing hiring decisions) due to their enhanced resistance against response distortions (e.g., faking good, impression management). The measurement precisions of FC assessment scores used to inform personnel decisions are of paramount importance in practice. Different types of reliability estimates are reported for FC assessment scores in current publications, while consensus on best practices appears to be lacking. In order to provide understanding and structure around the reporting of FC reliability, this study systematically examined different types of reliability estimation methods for Thurstonian IRT-based FC assessment scores: their theoretical differences were discussed, and their numerical differences were illustrated through a series of simulations and empirical studies. In doing so, this study provides a practical guide for appraising different reliability estimation methods for IRT-based FC assessment scores.

Download Full-text

Impression Management and Career Related Outcomes: A Systematic Literature Review

Frontiers in Psychology ◽

10.3389/fpsyg.2021.701694 ◽

2021 ◽

Vol 12 ◽

Author(s):

Esraa Al-Shatti ◽

Marc Ohana

Keyword(s):

Literature Review ◽

Impression Management ◽

Systematic Literature Review ◽

Empirical Studies ◽

Face To Face ◽

Different Types ◽

Recent Developments ◽

The Face ◽

Modes Of Interaction ◽

Development Perspective

Despite the popularity of the term impression management (IM) in the literature, there is no consensus as how different types of IM (direct vs. indirect) and modes of interaction (face-to-face vs. online) promote career-related outcomes. While most empirical studies focus on direct IM, individuals engage in both types of IM and interaction modes, particularly indirect IM in the online context. Indeed, recent developments suggest that online interactions now prevail over face-to-face interactions, especially during the COVID-19 pandemic. Accordingly, this study presents the first systematic literature review that differentiates between types of IM (direct vs. indirect) and modes of interaction (face-to-face vs. online) in a career development perspective. The review shows that direct IM is more widely studied in the face-to-face than online interaction mode, while indirect IM is neglected in both interaction modes. This study thus provides evidence of the need to investigate and differentiate between the different types of IM and interaction modes for career-related outcomes, highlighting some research gaps and directions for future inquiry.

Download Full-text

On the Statistical and Practical Limitations of Thurstonian IRT Models

Educational and Psychological Measurement ◽

10.1177/0013164419832063 ◽

2019 ◽

Vol 79 (5) ◽

pp. 827-854 ◽

Cited By ~ 6

Author(s):

Paul-Christian Bürkner ◽

Niklas Schulte ◽

Heinz Holling

Keyword(s):

Rating Scale ◽

R Package ◽

Forced Choice ◽

Model Specification ◽

Estimation Methods ◽

High Stakes ◽

Irt Model ◽

Irt Models ◽

Response Biases ◽

Sizable Number

Forced-choice questionnaires have been proposed to avoid common response biases typically associated with rating scale questionnaires. To overcome ipsativity issues of trait scores obtained from classical scoring approaches of forced-choice items, advanced methods from item response theory (IRT) such as the Thurstonian IRT model have been proposed. For convenient model specification, we introduce the thurstonianIRT R package, which uses Mplus, lavaan, and Stan for model estimation. Based on practical considerations, we establish that items within one block need to be equally keyed to achieve similar social desirability, which is essential for creating forced-choice questionnaires that have the potential to resist faking intentions. According to extensive simulations, measuring up to five traits using blocks of only equally keyed items does not yield sufficiently accurate trait scores and inter-trait correlation estimates, neither for frequentist nor for Bayesian estimation methods. As a result, persons’ trait scores remain partially ipsative and, thus, do not allow for valid comparisons between persons. However, we demonstrate that trait scores based on only equally keyed blocks can be improved substantially by measuring a sizable number of traits. More specifically, in our simulations of 30 traits, scores based on only equally keyed blocks were non-ipsative and highly accurate. We conclude that in high-stakes situations where persons are motivated to give fake answers, Thurstonian IRT models should only be applied to tests measuring a sizable number of traits.

Download Full-text

On the Statistical and Practical Limitations of Thurstonian IRT Models

10.31234/osf.io/dbwn8 ◽

2018 ◽

Author(s):

Paul - Christian Bürkner ◽

Niklas Schulte ◽

Heinz Holling

Keyword(s):

Rating Scale ◽

R Package ◽

Forced Choice ◽

Model Specification ◽

Estimation Methods ◽

High Stakes ◽

Irt Model ◽

Irt Models ◽

Response Biases ◽

Sizeable Number

Forced-choice questionnaires have been proposed to avoid common response biases typically associated with rating scale questionnaires. To overcome ipsativity issues of trait scores obtained from classical scoring approaches of forced-choice items, advanced methods from item response theory (IRT) such as the Thurstonian IRT model have been proposed. For convenient model specification, we introduce the thurstonianIRT R package, which uses Mplus, lavaan, and Stan for model estimation. Based on practical considerations, we establish that items within one block need to be equally keyed to achieve similar social desirability, which is essential for creating force-choice questionnaires that have the potential to resist faking intentions. According to extensive simulations, measuring up to 5 traits using blocks of only equally keyed items does not yield sufficiently accurate trait scores and inter-trait correlation estimates, neither for frequentist nor Bayesian estimation methods. As a result, persons' trait scores remain partially ipsative and, thus, do not allow for valid comparisons between persons. However, we demonstrate that trait scores based on only equally keyed blocks can be improved substantially by measuring a sizeable number of traits. More specifically, in our simulations of 30 traits, scores based on only equally keyed blocks were non-ipsative and highly accurate. We conclude that in high-stakes situations where persons are motivated to give fake answers, Thurstonian IRT models should only be applied to tests measuring a sizeable number of traits.

Download Full-text

Effect of sleep on memory for binding different types of visual information

10.31234/osf.io/5zcw7 ◽

2020 ◽

Author(s):

John J Shaw ◽

Zhisen Urgolites ◽

Padraic Monaghan

Keyword(s):

Visual Information ◽

Storage Capacity ◽

Declarative Memory ◽

Recognition Task ◽

Forced Choice ◽

Long Term Memory ◽

Term Memory ◽

Different Types ◽

Types Of Information

Visual long-term memory has a large and detailed storage capacity for individual scenes, objects, and actions. However, memory for combinations of actions and scenes is poorer, suggesting difficulty in binding this information together. Sleep can enhance declarative memory of information, but whether sleep can also boost memory for binding information and whether the effect is general across different types of information is not yet known. Experiments 1 to 3 tested effects of sleep on binding actions and scenes, and Experiments 4 and 5 tested binding of objects and scenes. Participants viewed composites and were tested 12-hours later after a delay consisting of sleep (9pm-9am) or wake (9am-9pm), on an alternative forced choice recognition task. For action-scene composites, memory was relatively poor with no significant effect of sleep. For object-scene composites sleep did improve memory. Sleep can promote binding in memory, depending on the type of information to be combined.

Download Full-text

A Comparison of Reliability Estimation Based on Confirmatory Factor Analysis and Exploratory Structural Equation Models

Educational and Psychological Measurement ◽

10.1177/00131644211008953 ◽

2021 ◽

pp. 001316442110089

Author(s):

Yuanshu Fu ◽

Zhonglin Wen ◽

Yang Wang

Keyword(s):

Factor Analysis ◽

Confirmatory Factor Analysis ◽

Structural Equation ◽

Reliability Estimation ◽

Model Fit ◽

Equation Modeling ◽

Factor Loadings ◽

Reliability Estimates ◽

Confirmatory Factor ◽

Composite Reliability

Composite reliability, or coefficient omega, can be estimated using structural equation modeling. Composite reliability is usually estimated under the basic independent clusters model of confirmatory factor analysis (ICM-CFA). However, due to the existence of cross-loadings, the model fit of the exploratory structural equation model (ESEM) is often found to be substantially better than that of ICM-CFA. The present study first illustrated the method used to estimate composite reliability under ESEM and then compared the difference between ESEM and ICM-CFA in terms of composite reliability estimation under various indicators per factor, target factor loadings, cross-loadings, and sample sizes. The results showed no apparent difference in using ESEM or ICM-CFA for estimating composite reliability, and the rotation type did not affect the composite reliability estimates generated by ESEM. An empirical example was given as further proof of the results of the simulation studies. Based on the present study, we suggest that if the model fit of ESEM (regardless of the utilized rotation criteria) is acceptable but that of ICM-CFA is not, the composite reliability estimates based on the above two models should be similar. If the target factor loadings are relatively small, researchers should increase the number of indicators per factor or increase the sample size.

Download Full-text

To BAN or Not to BAN: Bayesian Attention Networks for Reliable Hate Speech Detection

Cognitive Computation ◽

10.1007/s12559-021-09826-9 ◽

2021 ◽

Author(s):

Kristian Miok ◽

Blaž Škrlj ◽

Daniela Zaharie ◽

Marko Robnik-Šikonja

Keyword(s):

Monte Carlo ◽

Hate Speech ◽

Classification Performance ◽

Reliability Estimation ◽

Superior Performance ◽

Speech Detection ◽

Attention Networks ◽

Reliability Estimates ◽

Viable Mechanism ◽

Affective Dimensions

AbstractHate speech is an important problem in the management of user-generated content. To remove offensive content or ban misbehaving users, content moderators need reliable hate speech detectors. Recently, deep neural networks based on the transformer architecture, such as the (multilingual) BERT model, have achieved superior performance in many natural language classification tasks, including hate speech detection. So far, these methods have not been able to quantify their output in terms of reliability. We propose a Bayesian method using Monte Carlo dropout within the attention layers of the transformer models to provide well-calibrated reliability estimates. We evaluate and visualize the results of the proposed approach on hate speech detection problems in several languages. Additionally, we test whether affective dimensions can enhance the information extracted by the BERT model in hate speech classification. Our experiments show that Monte Carlo dropout provides a viable mechanism for reliability estimation in transformer networks. Used within the BERT model, it offers state-of-the-art classification performance and can detect less trusted predictions.

Download Full-text

Inquiring into Notions of Educational Improvement by Teaching Where We Think: Philosophical Meditations as a Practice of Teacher Education

Teachers College Record ◽

10.1177/016146812012200405 ◽

2020 ◽

Vol 122 (4) ◽

pp. 1-26

Author(s):

Maria Paula Ghiso ◽

Stephanie A. Burdick-Shepherd

Keyword(s):

Teacher Education ◽

Early Childhood ◽

School Change ◽

Educational Improvement ◽

High Stakes ◽

Global Marketplace ◽

Assessment Scores ◽

Theoretical Perspectives ◽

Teachers And Students ◽

Philosophical Meditation

Background This paper is part of the special issue “Reimagining Research and Practice at the Crossroads of Philosophy, Teaching, and Teacher Education.” Early childhood initiatives have joined a nexus of educational reforms characterized by increased accountability and a focus on measurement as a marker of student and teacher learning, with early education being framed as an economic good necessary for competing in the global marketplace. Underlying the recent push for early childhood education is what we see as a “discourse of improvement”—depictions of school change that prioritize achievement as reflected in assessment scores, data collection on teacher effectiveness, and high-stakes evaluation. These characteristics, we argue, foster increasingly inequitable educational contexts and obscure the particularities of what it means to be a child in the world. Purpose We use the practice of philosophical meditation, as articulated in Pierre Hadot's examination of philosophy as a way of life, to inquire into the logics of educational improvement as instantiated in particular contexts, and for cultivating cross-disciplinary partnerships committed to fostering children's flourishing. We link this meditational focus with feminist and de-colonial theoretical perspectives to make visible the role of power in the characterization of children's learning as related to norms of development, minoritized identities, and hierarchies of knowledge. Research Design: In this collaborative inquiry, we compose a series of meditations on our experiences with the logics of improvement inspired by 12 months of systematic conversation. Our data sources include correspondence between the two authors, written reflections on specific practices in teacher education each author engages with, and a set of literary, philosophical, and teacher education texts. Conclusions/Recommendations Our meditations illuminate the value of collective inquiry about what constitutes improvement in schools. We raise questions about how the measurement of learning is entwined in historical and present-day relations of power and idealized formulations of the universal “child” or “teacher” and argue that we must work together to reimagine the framings that inform our work. Ultimately and most directly, these meditations can support dynamic attempts to cultivate meaningful and more equitable educational experiences for teachers and students. Philosophical meditations at the crossroads of philosophy, teaching, and teacher education thus extend beyond critique toward imagining and enacting a better world in our classrooms, even though (and especially when) this path is not clear.

Download Full-text

Reputation, concessions, and territorial civil war

Journal of Peace Research ◽

10.1177/0022343318767499 ◽

2018 ◽

Vol 55 (5) ◽

pp. 671-686 ◽

Cited By ~ 1

Author(s):

Nils-Christian Bormann ◽

Burcu Savun

Keyword(s):

Ethnic Groups ◽

Armed Conflict ◽

Empirical Studies ◽

Power Sharing ◽

Self Determination ◽

Territorial Disputes ◽

Territorial Conflicts ◽

Different Types ◽

Future Demands ◽

Reputation Building

Barbara Walter’s application of reputation theory to self-determination movements has advanced our understanding of why many separatist movements result in armed conflict. Walter has shown that governments of multi-ethnic societies often respond to territorial disputes with violence to deter similar future demands by other ethnic groups. When governments grant territorial accommodation to one ethnic group, they encourage other ethnic groups to seek similar concessions. However, a number of recent empirical studies casts doubt on the validity of Walter’s argument. We address recent challenges to the efficacy of reputation building in the context of territorial conflicts by delineating the precise scope conditions of reputation theory. First, we argue that only concessions granted after fighting should trigger additional conflict onsets. Second, the demonstration effects should particularly apply to groups with grievances against the state. We then test the observable implications of our conditional argument for political power-sharing concessions. Using a global sample of ethnic groups in 120 states between 1946 and 2013, we find support for our arguments. Our theoretical framework enables us to identify the conditions under which different types of governmental concessions are likely to trigger future conflicts, and thus has important implications for conflict resolution.

Download Full-text

The Empirics of the Digital Divide

Digital Literacy ◽

10.4018/978-1-4666-1852-7.ch016 ◽

2013 ◽

pp. 294-312

Author(s):

Wei-Min Hu ◽

James E. Prieger

Keyword(s):

Digital Divide ◽

Empirical Studies ◽

Duration Analysis ◽

Estimation Methods ◽

Cross Sectional ◽

Technological Adoption ◽

Digital Divides ◽

Econometric Methodology ◽

Catch Up ◽

The U.S

Accurate measurement of digital divides is important for policy purposes. Empirical studies of broadband subscription gaps have largely used cross-sectional data, which cannot speak to the timing of technological adoption. Yet, the dynamics of a digital divide are important and deserve study. With the goal of improving our understanding of appropriate techniques for analyzing digital divides, we review econometric methodology and propose the use of duration analysis. We compare the performance of alternative estimation methods using a large dataset on DSL subscription in the U.S., paying particular attention to whether women, blacks, and Hispanics catch up to others in the broadband adoption race. We conclude that duration analysis best captures the dynamics of the broadband gaps and is a useful addition to the analytic tool box of digital divide researchers. Our results support the official collection of broadband statistics in panel form, where the same households are followed over time.

Download Full-text

Individual Prediction Reliability Estimates in Classification and Regression

Intelligent Data Analysis for Real-Life Applications ◽

10.4018/978-1-4666-1806-0.ch003 ◽

2012 ◽

pp. 35-56

Author(s):

Darko Pevec ◽

Zoran Bosnic ◽

Igor Kononenko

Keyword(s):

Empirical Evaluation ◽

Cancer Recurrence ◽

Machine Learning Algorithms ◽

Reliability Estimation ◽

Local Error ◽

Research Areas ◽

Reliability Estimates ◽

Benchmark Datasets ◽

Classification And Regression ◽

Prediction Reliability

Current machine learning algorithms perform well in many problem domains, but in risk-sensitive decision making – for example, in medicine and finance – experts do not rely on common evaluation methods that provide overall assessments of models because such techniques do not provide any information about single predictions. This chapter summarizes the research areas that have motivated the development of various approaches to individual prediction reliability. Based on these motivations, the authors describe six approaches to reliability estimation: inverse transduction, local sensitivity analysis, bagging variance, local cross-validation, local error modelling, and density-based estimation. Empirical evaluation of the benchmark datasets provides promising results, especially for use with decision and regression trees. The testing results also reveal that the reliability estimators exhibit different performance levels when used with different models and in different domains. The authors show the usefulness of individual prediction reliability estimates in attempts to predict breast cancer recurrence. In this context, estimating prediction reliability for individual predictions is of crucial importance for physicians seeking to validate predictions derived using classification and regression models.

Download Full-text