Modern Test Theory Techniques for Adaptive Testing in Short Scales Comprising Polytomous Items: A Monte Carlo Simulation Study Comparing Rasch Measurement Theory to Unidimensional and Multidimensional Graded Response Models. (Preprint)

2021 ◽  
Author(s):  
Conrad J. Harrison ◽  
Bao Sheng Loe ◽  
Inge Apon ◽  
Chris J. Sidey-Gibbons ◽  
Marc C. Swan ◽  
...  

BACKGROUND There are two philosophical approaches to contemporary psychometrics: Rasch measurement theory (RMT) and item response theory (IRT). Either measurement strategy can be applied to computerized adaptive testing (CAT). There are potential benefits of IRT over RMT with regards to measurement precision, but also potential risks to measurement generalizability. RMT CAT assessments have demonstrated good performance with the CLEFT-Q, a patient-reported outcome measure for use in orofacial clefting. OBJECTIVE To test whether the post-hoc application of IRT (graded response models, GRMs, and multidimensional GRMs) to RMT-validated CLEFT-Q appearance scales could improve CAT accuracy at given assessment lengths. METHODS Partial credit Rasch models, unidimensional GRMs and a multidimensional GRM were calibrated for each of the 7 CLEFT-Q appearance scales (which measure the appearance of the: face, jaw, teeth, nose, nostrils, cleft lip scar and lips) using data from the CLEFT-Q field test. A second, simulated dataset was generated with 1000 plausible response sets to each scale. Rasch and GRM scores were calculated for each simulated response set, scaled to 0-100 scores, and compared by Pearson’s correlation coefficient, root mean square error (RMSE), mean absolute error (MAE) and 95% limits of agreement. For the face, teeth and jaw scales, we repeated this in a an independent, real patient dataset. We then used the simulated data to compare the performance of a range of fixed-length CAT assessments that were generated with partial credit Rasch models, unidimensional GRMs and the multidimensional GRM. Median standard error of measurement (SEM) was recorded for each assessment. CAT scores were scaled to 0-100 and compared to linear assessment Rasch scores with RMSE, MAE and 95% limits of agreement. This was repeated in the independent, real patient dataset with the RMT and unidimensional GRM CAT assessments for the face, teeth and jaw scales to test the generalizability of our simulated data analysis. RESULTS Linear assessment scores generated by Rasch models and unidimensional GRMs showed close agreement, with RMSE ranging from 2.2 to 6.1, and MAE ranging from 1.5 to 4.9 in the simulated dataset. These findings were closely reproduced in the real patient dataset. Unidimensional GRM CAT algorithms achieved lower median SEM than Rasch counterparts, but reproduced linear assessment scores with very similar accuracy (RMSE, MAE and 95% limits of agreement). The multidimensional GRM had poorer accuracy than the unidimensional models at comparable assessment lengths. CONCLUSIONS Partial credit Rasch models and GRMs produce very similar CAT scores. GRM CAT assessments achieve a lower SEM, but this does not translate into better accuracy. Commonly used SEM heuristics for target measurement reliability should not be generalized across CAT assessments built with different psychometric models. In this study, a relatively parsimonious multidimensional GRM CAT algorithm performed more poorly than unidimensional GRM comparators.

2021 ◽  
pp. 014662162110131
Author(s):  
Leah Feuerstahler ◽  
Mark Wilson

In between-item multidimensional item response models, it is often desirable to compare individual latent trait estimates across dimensions. These comparisons are only justified if the model dimensions are scaled relative to each other. Traditionally, this scaling is done using approaches such as standardization—fixing the latent mean and standard deviation to 0 and 1 for all dimensions. However, approaches such as standardization do not guarantee that Rasch model properties hold across dimensions. Specifically, for between-item multidimensional Rasch family models, the unique ordering of items holds within dimensions, but not across dimensions. Previously, Feuerstahler and Wilson described the concept of scale alignment, which aims to enforce the unique ordering of items across dimensions by linearly transforming item parameters within dimensions. In this article, we extend the concept of scale alignment to the between-item multidimensional partial credit model and to models fit using incomplete data. We illustrate this method in the context of the Kindergarten Individual Development Survey (KIDS), a multidimensional survey of kindergarten readiness used in the state of Illinois. We also present simulation results that demonstrate the effectiveness of scale alignment in the context of polytomous item response models and missing data.


2014 ◽  
Vol 22 (2) ◽  
pp. 323-341 ◽  
Author(s):  
Dheeraj Raju ◽  
Xiaogang Su ◽  
Patricia A. Patrician

Background and Purpose: The purpose of this article is to introduce different types of item response theory models and to demonstrate their usefulness by evaluating the Practice Environment Scale. Methods: Item response theory models such as constrained and unconstrained graded response model, partial credit model, Rasch model, and one-parameter logistic model are demonstrated. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) indices are used as model selection criterion. Results: The unconstrained graded response and partial credit models indicated the best fit for the data. Almost all items in the instrument performed well. Conclusions: Although most of the items strongly measure the construct, there are a few items that could be eliminated without substantially altering the instrument. The analysis revealed that the instrument may function differently when administered to different unit types.


2018 ◽  
Vol 28 (10-11) ◽  
pp. 3273-3285 ◽  
Author(s):  
Hui Zhang ◽  
Li Tang ◽  
Yuanyuan Kong ◽  
Tian Chen ◽  
Xueyan Liu ◽  
...  

Many biomedical and psychosocial studies involve population mixtures, which consist of multiple latent subpopulations. Because group membership cannot be observed, standard methods do not apply when differential treatment effects need to be studied across subgroups. We consider a two-group mixture in which membership of latent subgroups is determined by structural zeroes of a zero-inflated count variable and propose a new approach to model treatment differences between latent subgroups in a longitudinal setting. It has also been incorporated with the inverse probability weighted method to address data missingness. As the approach builds on the distribution-free functional response models, it requires no parametric distribution model and thereby provides a robust inference. We illustrate the approach with both real and simulated data.


Author(s):  
Patrick Taffé ◽  
Mingkai Peng ◽  
Vicki Stagg ◽  
Tyler Williamson

Bland and Altman's (1986, Lancet 327: 307–310) limits of agreement have been used in many clinical research settings to assess agreement between two methods of measuring a quantitative characteristic. However, when the variances of the measurement errors of the two methods differ, limits of agreement can be misleading. biasplot implements a new statistical methodology that Taffé (Forthcoming, Statistical Methods in Medical Research) recently developed to circumvent this issue and assess bias and precision of the two measurement methods (one is the reference standard, and the other is the new measurement method to be evaluated). biasplot produces three new plots introduced by Taffé: the “bias plot”, “precision plot”, and “comparison plot”. These help the investigator visually evaluate the performance of the new measurement method. In this article, we introduce the user-written command biasplot and present worked examples using simulated data included with the package. Note that the Taffé method assumes there are several measurements from the reference standard and possibly as few as one measurement from the new method for each individual.


2000 ◽  
Vol 25 (3) ◽  
pp. 253-270 ◽  
Author(s):  
John G. Baker ◽  
James B. Rounds ◽  
Michael A. Zevon

Two multiple category item response theory models are compared using a data set of 52 mood terms with 713 subjects. Tellegen’s (1985) model of mood with two independent, unipolar dimensions of positive and negative affect provided a theoretical basis for the assumption of unidimensionality. Principle components analysis and item parameter tests supported the unidimensionality assumption. Comparative model data fit for the Samejima (1969) logistic model for graded responses and the Masters (1982) partial credit model favored the former model for this particular data set. Theoretical and practical aspects of the comparative application of multiple category models in the measurement of subjective well-being or mood are discussed.


2018 ◽  
Vol 11 (3) ◽  
pp. 205979911881439 ◽  
Author(s):  
Stefanie A Wind

Model-data fit indices for raters provide insight into the degree to which raters demonstrate psychometric properties defined as useful within a measurement framework. Fit statistics for raters are particularly relevant within frameworks based on invariant measurement, such as Rasch measurement theory and Mokken scale analysis. A simple approach to examining invariance is to examine assessment data for evidence of Guttman errors. I used real and simulated data to illustrate and explore a nonparametric procedure for evaluating rater errors based on Guttman errors and to examine the alignment between Guttman errors and other indices of rater fit. The results suggested that researchers and practitioners can use summaries of Guttman errors to identify raters who exhibit misfit. Furthermore, results from the comparisons between summaries of Guttman errors and parametric fit statistics suggested that both approaches detect similar problematic measurement characteristics. Specifically, raters who exhibit many Guttman errors tended to have higher-than-expected Outfit MSE statistics and lower-than-expected estimated slope statistics. I discuss implications of these results as they relate to research and practice for rater-mediated assessments.


Author(s):  
Mera Usman Muhammed ◽  
Mayaki Abubakar Musa ◽  
Gambo Abdulrahman Abdullahi

This study was carried out to compare the digital rectal (DR) thermometer with non-contact infrared thermometer (IRT) measurements at two locations on the face in some large animal species. Two hundred and forty (240) animals comprising of equal numbers of three species (cattle, camel and horses) of varying age and either sex was used. The IR temperature was taken from two sites [frontal (FIRT) and temporal (TIRT) region] on the animal face. The mean IR temperatures (FIRT and TIRT) were higher than the RT in all the animal species. The two thermometers correlate poorly in all the animal species. Bland-Altman analysis showed high biases and limits of agreement not acceptable for clinical purposes. In conclusion, IRT seems to offer a quick and easy way to determine the animal temperature but clinically it cannot be used interchangeably with DR thermometer at the moment for body temperature measurement in these animal species.


Author(s):  
Anwar Bani Hani Et al.

The study aimed at developing and validating the mathematics test for 10th –grade students according to the Rasch partial credit model (PCM) by using the descriptive approach as it is appropriate for the study aims. To achieve the study's objective, what constructed the essay type test consisted of 25 items based on the (IRT) according to the Rasch PCM. what conducted the first administration of the test to verify the validity and reliability of the test. To verify the "face validity" of the test's objectives, they were presented to a group of 12 arbitrators who work as teachers and educational supervisors. They found that the contents are representative of the level of the goal, which is pursuing in theory. The empirical reliability was calculated for the test, where the value of person reliability reached 0.91. Moreover, the items reached 0.93. The study population consisted of all 10th-grade students at the schools belonging to the “Directorate of Education of Irbid District,” whose numbers were 7365, represented by 3612 male students, and 3753 female students. According to the class regarding their sex (gender), a sample according to a cluster as the test unit was the class section. The sample size of the study was 250 male and female students. According to the PCM, this study's findings have brought several issues concerning the mathematics subject achievement by verifying the tests and reliability and accomplishing the IRT's suppositions.


2020 ◽  
pp. short17-1-short17-8
Author(s):  
Fedor Shvetsov ◽  
Anton Konushin ◽  
Anna Sokolova

In this work, we consider the applicability of the face recognition algorithms to the data obtained from a dynamic vision sensor. A basic method using a neural network model comprised of reconstruction, detection, and recognition is proposed that solves this problem. Various modifications of this algorithm and their influence on the quality of the model are considered. A small test dataset recorded on a DVS sensor is collected. The relevance of using simulated data and different approaches for its creation for training a model was investigated. The portability of the algorithm trained on synthetic data to the data obtained from the sensor with the help of fine-tuning was considered. All mentioned variations are compared to one another and also compared with conventional face recognition from RGB images on different datasets. The results showed that it is possible to use DVS data to perform face recognition with quality similar to that of RGB data.


Sign in / Sign up

Export Citation Format

Share Document