Ice Is Hot and Water Is Dry

European Journal of Psychological Assessment ◽

10.1027/1015-5759/a000691 ◽

2021 ◽

Author(s):

Natalie Förster ◽

Jörg-Tobias Kuhn

Keyword(s):

Item Difficulty ◽

Second Graders ◽

Design Matrix ◽

Reading Tests ◽

Reading Processes ◽

Irt Models ◽

Item Parameters ◽

Intensity Parameters ◽

Equivalent Test ◽

Theoretical Considerations

Abstract. To monitor students’ progress and adapt instruction to students’ needs, teachers increasingly use repeated assessments of equivalent tests. The present study investigates whether equivalent reading tests can be successfully developed via rule-based item design. Based on theoretical considerations, we identified 3-item features for reading comprehension at the word, sentence, and text levels, respectively, which should influence the difficulty and time intensity of reading processes. Using optimal design algorithms, a design matrix was calculated, and four equivalent test forms of the German reading test series for second graders (quop-L2) were developed. A total of N = 7,751 students completed the tests. We estimated item difficulty and time intensity parameters as well as person ability and speed parameters using bivariate item response theory (IRT) models, and we investigated the influence of item features on item parameters. Results indicate that all item properties significantly affected either item difficulty or response time. Moreover, as indicated by the IRT-based test information functions and analyses of variance, the four different test forms showed similar levels of difficulty and time-intensity at the word, sentence, and text levels (all η2 < .002). Results were successfully cross-validated using a sample of N = 5,654 students.

Download Full-text

An Exploration of the Relationships Between Different Reading Strategies and IELTS Test Performance

International Journal of Translation Interpretation and Applied Linguistics ◽

10.4018/ijtial.2020010101 ◽

2020 ◽

Vol 2 (1) ◽

pp. 1-19

Author(s):

Rob Kim Marjerison ◽

Pengfei Liu ◽

Liam P. Duffy ◽

Rongjuan Chen

Keyword(s):

Test Performance ◽

Reading Strategies ◽

Quantitative Research ◽

Reading Tests ◽

Reading Processes ◽

Academic Reading ◽

Correlational Design ◽

The Impact ◽

Test Outcomes ◽

Us University

This study explores which types of IELTS Academic Reading strategies are used, and the impact of these strategies on test outcomes. The study was a quantitative research, using descriptive-correlational design based on data collected from students at Sino-US University in China. Descriptive and inferential statistics were used to analyze the data. The method used in this study was a partial replication the work of a previous researcher's exploration of the reading processes learners engage in when taking IELTS Reading tests. Participants first finished an IELTS reading test, and then completed a written retrospective protocol. The analysis reveals that there is a moderately positive relationship between the choice of text preview strategy (from 1 to 5) and the outcome. A pattern was identified that using expeditious reading strategies to initially locate information, and more careful reading strategies to identify answers to the question tasks was common among high-scoring participants.

Download Full-text

Investigating the Practical Consequences of Model Misfit in Unidimensional IRT Models

10.31234/osf.io/5y7am ◽

2019 ◽

Author(s):

Daniela Ramona Crișan ◽

Jorge Tendeiro ◽

Rob Meijer

Keyword(s):

Item Response ◽

Selection Criteria ◽

Simulated Data ◽

Model Parameters ◽

Irt Models ◽

Item Parameters ◽

Selection Decisions ◽

Precision And Accuracy ◽

Model Misfit ◽

True Ability

In this chapter, the practical consequences of violations of unidimensionality on selection decisions in the framework of unidimensional item response theory (IRT) models are investigated based on simulated data. The factors manipulated include the severity of violations, the proportion of misfitting items, and test length. The outcomes that were considered are the precision and accuracy of the estimated model parameters, the correlations of estimated ability (θ-hat) and number-correct (NC) scores with the true ability (θ), the ranks of the examinees and the overlap between sets of examinees selected based on either θ, θ-hat, or NC scores, and the bias in criterion-related validity estimates. Results show that the θ-hat values were unbiased by violations of unidimensionality, but their precision decreased as multidimensionality and the proportion of misfitting items increased; the estimated item parameters were robust to violations of un dimensionality. The correlations between θ, θ-hat, and NC scores, the agreement between the three selection criteria, and the accuracy of criterion-related validity estimates are all negatively affected, to some extent, by increasing levels of multidimensionality and the proportion of misfitting items. However, removing the misfitting items only improved the results in the case of severe multidimensionality and large proportion of misfitting items, and deteriorated them otherwise.

Download Full-text

Examining Word List Selection and Performance: An Explanatory Item Analysis of the CERAD Word List Learning Test

10.31234/osf.io/y5urf ◽

2021 ◽

Author(s):

William Goette

Keyword(s):

Fixed Effects ◽

Bayesian Model Averaging ◽

Item Difficulty ◽

Word List ◽

Information Criteria ◽

Modeling Framework ◽

Assessment Protocol ◽

Irt Models ◽

List Learning ◽

Learning Test

Objective: Develop and test an explanatory item response theory model (IRT) that examines properties of both the test (e.g., word order, learning over trials) and items (e.g., frequency of words in English) on the CERAD List Learning Test immediate recall trials.Methods: Item-level response data from 1050 participants (Mage=73.74 [SD=6.89], Medu=13.77 [SD=2.41]) in the Harmonized Cognitive Assessment Protocol were used to construct various IRT models. A Bayesian generalized (non-)linear multilevel modeling framework was utilized to specify the Rasch and two-parameter logistic (2PL) IRT models. Leave-one-out cross-validation information criteria and pseudo-Bayesian model averaging were used to compare models. Posterior predictive checks helped validate model performance in predicting data observations. Fixed effects for learning over trials, serial position of words, and 9 word properties of the words (obtained through the English Lexicon Project) were modeled for their effects on item properties.Results: A random person, random item 2PL model with an item-specific inter-trial learning effect (i.e., local dependency effect) provided the best fit of any of the models examined. Of the 9 word traits examined, only 4 has highly probable effects on item difficulty such that words became harder to learn with increasing frequency in English, average age of acquisition, and concreteness and lower levels of body-object integration.Conclusions: Results support that memory performance depends on more than repetition of words across trials. The finding that word traits affect difficulty and predict learning raise interesting potentials for test translation, equating word lists, and extending test interpretation to more nuanced semantic deficits.

Download Full-text

Random Item MIRID Modeling and Its Application

Applied Psychological Measurement ◽

10.1177/0146621616675835 ◽

2016 ◽

Vol 41 (2) ◽

pp. 97-114 ◽

Cited By ~ 1

Author(s):

Yongsang Lee ◽

Mark Wilson

Keyword(s):

Item Difficulty ◽

Empirical Studies ◽

Cognitive Behavior ◽

Random Errors ◽

Restricted Model ◽

Item Parameters ◽

Component Processes

The Model With Internal Restrictions on Item Difficulty (MIRID; Butter, 1994) has been useful for investigating cognitive behavior in terms of the processes that lead to that behavior. The main objective of the MIRID model is to enable one to test how component processes influence the complex cognitive behavior in terms of the item parameters. The original MIRID model is, indeed, a fairly restricted model for a number of reasons. One of these restrictions is that the model treats items as fixed and does not fit measurement contexts where the concept of the random items is needed. In this article, random item approaches to the MIRID model are proposed, and both simulation and empirical studies to test and illustrate the random item MIRID models are conducted. The simulation and empirical studies show that the random item MIRID models provide more accurate estimates when substantial random errors exist, and thus these models may be more beneficial.

Download Full-text

ANALISIS RESPONS BUTIR PADA TES BAKAT SKOLASTIK

Jurnal Psikologi ◽

10.14710/jp.17.1.1-17 ◽

2018 ◽

Vol 17 (1) ◽

pp. 1

Author(s):

Farida Agus Setiawati ◽

Rita Eka Izzaty ◽

Veny Hidayat

Keyword(s):

Item Difficulty ◽

Scholastic Aptitude Test ◽

Index Test ◽

Information Function ◽

Item Discrimination ◽

Test Information ◽

Scholastic Aptitude ◽

Irt Models ◽

Verbal Subtest ◽

Test Information Function

This study aims to analyze the characteristics of the Scholastic Aptitude Test (SAT), consisting of both verbal and numerical subtests. We used a descriptive quantitative approach by describing the characteristics of SAT based on the degree of item difficulty, item discrimination index, pseudoguessing index, test information function and standard error measurement. The data are responses of the SAT instrument, collected from 1,047 subjects in Yogyakarta using the documentation technique. Data were then analyzed by Item Response Theory (IRT) approach with the help of the BILOG program on all logistic parameter models, preceded by identifying item suitability with the model. Analysis concludes that: verbal subtest tends to compliment the 2-PL and 3-PL model, meanwhile, numerical subtest only fit the 2-PL model. Majority items of SAT have a good characteristic on index of item difficulty, item discrimination, and pseudoguessing, and based of test information function, SAT is accurate to be used in the 1-PL, 2-PL, and 3-PL IRT models for all level of ability.

Download Full-text

Theoretical considerations and development of a questionnaire to measure trust in automation

10.31234/osf.io/nfc45 ◽

2018 ◽

Cited By ~ 5

Author(s):

Moritz Körber

Keyword(s):

Driving Simulator ◽

Item Difficulty ◽

Preliminary Evidence ◽

Gaze Behavior ◽

Automated Driving ◽

Driving System ◽

Online Study ◽

Trust In Automation ◽

Trust Intention ◽

Theoretical Considerations

The increasing number of interactions with automated systems has sparked the interest of researchers in trust in automation because it predicts not only whether but also how an operator interacts with an automation. In this work, a theoretical model of trust in automation is established and the development and evaluation of a corresponding questionnaire (Trust in Automation, TiA) are described. Building on the model of organizational trust by Mayer, Davis, and Schoorman (1995) and the theoretical account by Lee and See (2004), a model for trust in automation containing six underlying dimensions was established. Following a deductive approach, an initial set of 57 items was generated. In a first online study, these items were analyzed and based on the criteria item difficulty, standard deviation, item-total correlation, internal consistency, overlap with other items in content, and response quote, 40 items were eliminated and two scales were merged, leaving six scales (Reliability/Competence, Understandability/Predictability, Propensity to Trust, Intention of Developers, Familiarity, and Trust in Automation) containing a total of 19 items. The internal structure of the resulting questionnaire was analyzed in a subsequent second online study by means of an exploratory factor analysis. The results show sufficient preliminary evidence for the proposed factor structure and demonstrate that further pursuit of the model is reasonable but certain revisions may be necessary. The calculated omega coefficients indicated good to excellent reliability for all scales. The results also provide evidence for the questionnaire’s criterion validity: Consistent with the expectations, an unreliable automated driving system received lower trust ratings as a reliably functioning system. In a subsequent empirical driving simulator study, trust ratings could predict reliance on an automated driving system and monitoring in form of gaze behavior. Possible steps for revisions are discussed and recommendations for the application of the questionnaire are given.

Download Full-text

SAS macros for longitudinal IRT models

10.7287/peerj.preprints.26740v1 ◽

2018 ◽

Author(s):

Maja Olsbjerg ◽

Karl Bang Christensen ◽

Keyword(s):

Latent Variables ◽

Latent Variable ◽

Well Being ◽

Repeated Measurements ◽

Longitudinal Models ◽

Time Points ◽

Irt Models ◽

Response Formats ◽

Item Parameters ◽

Polytomous Item Response

IRT models are often applied when observed items are used to measure a unidimensional latent variable. Originally used in educational research, IRT models are now widely used when focus is on physical functioning or psychological well-being. Modern applications often need more general models, typically models for multidimensional latent variables or longitudinal models for repeated measurements. This paper describes a collection of SAS macros that can be used for fitting data to, simulating from, and visualizing longitudinal IRT models. The macros encompass dichotomous as well as polytomous item response formats and are sufficiently flexible to accommodate changes in item parameters across time points and local dependence between responses at different time points.

Download Full-text

An Analysis of (Dis)Ordered Categories, Thresholds, and Crossings in Difference and Divide-by-Total IRT Models for Ordered Responses

The Spanish Journal of Psychology ◽

10.1017/sjp.2017.11 ◽

2017 ◽

Vol 20 ◽

Cited By ~ 6

Author(s):

Miguel A. García-Pérez

Keyword(s):

Scale Development ◽

Response Functions ◽

Ordered Categories ◽

Irt Models ◽

Item Parameters ◽

Practical Implications

AbstractThreshold parameters have distinct referents across models for ordered responses. In difference models, thresholds are trait levels at which responding beyond category k is as likely as responding at or below it; in divide-by-total models, thresholds are trait levels at which responding in category k is as likely as responding in category k – 1. Thus, thresholds in divide-by-total models (but not in difference models) are the crossings of the option response functions for consecutive categories. Thresholds in difference models are always ordered but they may inconsequentially yield ordered or disordered crossings. In contrast, assimilation of thresholds and crossings in divide-by-total models questions category order when crossings are disordered. We analyze these aspects of difference and divide-by-total models, their relation to the order of response categories, and the consequences of collapsing categories to instate ordered crossings under divide-by-total models. We also show that item parameters in models for ordered responses can never contradict the pre-assumed order of categories and that the empirical order can only be established using a polytomous model that does not assume ordered categories, although this often gives rise to spurious outcomes. Practical implications for scale development are discussed.

Download Full-text

Examining the Performance of the Metropolis–Hastings Robbins–Monro Algorithm in the Estimation of Multilevel Multidimensional IRT Models

Applied Psychological Measurement ◽

10.1177/0146621616688923 ◽

2017 ◽

Vol 41 (5) ◽

pp. 323-337 ◽

Cited By ~ 1

Author(s):

Bozhidar M. Bashkov ◽

Christine E. DeMars

Keyword(s):

Intraclass Correlation ◽

Multidimensional Item Response Theory ◽

Standard Errors ◽

Future Research ◽

Multidimensional Item Response ◽

Ability Estimates ◽

Irt Models ◽

Item Parameters ◽

Multidimensional Irt Models ◽

Cluster Level

The purpose of this study was to examine the performance of the Metropolis–Hastings Robbins–Monro (MH-RM) algorithm in the estimation of multilevel multidimensional item response theory (ML-MIRT) models. The accuracy and efficiency of MH-RM in recovering item parameters, latent variances and covariances, as well as ability estimates within and between clusters (e.g., schools) were investigated in a simulation study, varying the number of dimensions, the intraclass correlation coefficient, the number of clusters, and cluster size, for a total of 24 conditions. Overall, MH-RM performed well in recovering the item, person, and group-level parameters of the model. Ratios of the empirical to analytical standard errors indicated that the analytical standard errors reported in flexMIRT were somewhat overestimated for the cluster-level ability estimates, a little too large for the person-level ability estimates, and essentially accurate for the other parameters. Limitations of the study, implications for educational measurement practice, and directions for future research are offered.

Download Full-text

Rasch testlet model and bifactor analysis: how do they assess the dimensionality of large-scale Iranian EFL reading comprehension tests?

Language Testing in Asia ◽

10.1186/s40468-021-00118-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Masoud Geramipour

Keyword(s):

Reading Comprehension ◽

Goodness Of Fit ◽

Item Difficulty ◽

Item Parameter ◽

Rasch Models ◽

Measurement Models ◽

Item Parameters ◽

Bifactor Models ◽

Comprehension Tests ◽

Reading Comprehension Tests

AbstractRasch testlet and bifactor models are two measurement models that could deal with local item dependency (LID) in assessing the dimensionality of reading comprehension testlets. This study aimed to apply the measurement models to real item response data of the Iranian EFL reading comprehension tests and compare the validity of the bifactor models and corresponding item parameters with unidimensional and multidimensional Rasch models. The data collected from the EFL reading comprehension section of the Iranian national university entrance examinations from 2016 to 2018. Various advanced packages of the R system were employed to fit the Rasch unidimensional, multidimensional, and testlet models and the exploratory and confirmatory bifactor models. Then, item parameters estimated and testlet effects identified; moreover, goodness of fit indices and the item parameter correlations for the different models were calculated. Results showed that the testlet effects were all small but non-negligible for all of the EFL reading testlets. Moreover, bifactor models were superior in terms of goodness of fit, whereas exploratory bifactor model better explained the factor structure of the EFL reading comprehension tests. However, item difficulty parameters in the Rasch models were more consistent than the bifactor models. This study had substantial implications for methods of dealing with LID and dimensionality in assessing reading comprehension with reference to the EFL testing.

Download Full-text