A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing

2004 ◽  
Vol 9 (3) ◽  
pp. 239-261 ◽  
Author(s):  
Richard R Sudweeks ◽  
Suzanne Reeve ◽  
William S. Bradshaw
2021 ◽  
Vol 12 (1) ◽  
pp. 18
Author(s):  
Jennifer S Byrd ◽  
Michael J Peeters

Objective: There is a paucity of validation evidence for assessing clinical case-presentations by Doctor of Pharmacy (PharmD) students.  Within Kane’s Framework for Validation, evidence for inferences of scoring and generalization should be generated first.  Thus, our objectives were to characterize and improve scoring, as well as build initial generalization evidence, in order to provide validation evidence for performance-based assessment of clinical case-presentations. Design: Third-year PharmD students worked up patient-cases from a local hospital.  Students orally presented and defended their therapeutic care-plan to pharmacist preceptors (evaluators) and fellow students.  Evaluators scored each presentation using an 11-item instrument with a 6-point rating-scale.  In addition, evaluators scored a global-item with a 4-point rating-scale.  Rasch Measurement was used for scoring analysis, while Generalizability Theory was used for generalization analysis. Findings: Thirty students each presented five cases that were evaluated by 15 preceptors using an 11-item instrument.  Using Rasch Measurement, the 11-item instrument’s 6-point rating-scale did not work; it only worked once collapsed to a 4-point rating-scale.  This revised 11-item instrument also showed redundancy.  Alternatively, the global-item performed reasonably on its own.  Using multivariate Generalizability Theory, the g-coefficient (reliability) for the series of five case-presentations was 0.76 with the 11-item instrument, and 0.78 with the global-item.  Reliability was largely dependent on multiple case-presentations and, to a lesser extent, the number of evaluators per case-presentation.  Conclusions: Our pilot results confirm that scoring should be simple (scale and instrument).  More specifically, the longer 11-item instrument measured but had redundancy, whereas the single global-item provided measurement over multiple case-presentations.  Further, acceptable reliability can be balanced between more/fewer case-presentations and using more/fewer evaluators.


2007 ◽  
Vol 13 (4) ◽  
pp. 479-493 ◽  
Author(s):  
Cherdsak Iramaneerat ◽  
Rachel Yudkowsky ◽  
Carol M. Myford ◽  
Steven M. Downing

2020 ◽  
Vol 63 (6) ◽  
pp. 1947-1957
Author(s):  
Alexandra Hollo ◽  
Johanna L. Staubitz ◽  
Jason C. Chow

Purpose Although sampling teachers' child-directed speech in school settings is needed to understand the influence of linguistic input on child outcomes, empirical guidance for measurement procedures needed to obtain representative samples is lacking. To optimize resources needed to transcribe, code, and analyze classroom samples, this exploratory study assessed the minimum number and duration of samples needed for a reliable analysis of conventional and researcher-developed measures of teacher talk in elementary classrooms. Method This study applied fully crossed, Person (teacher) × Session (samples obtained on 3 separate occasions) generalizability studies to analyze an extant data set of three 10-min language samples provided by 28 general and special education teachers recorded during large-group instruction across the school year. Subsequently, a series of decision studies estimated of the number and duration of sessions needed to obtain the criterion g coefficient ( g > .70). Results The most stable variables were total number of words and mazes, requiring only a single 10-min sample, two 6-min samples, or three 3-min samples to reach criterion. No measured variables related to content or complexity were adequately stable regardless of number and duration of samples. Conclusions Generalizability studies confirmed that a large proportion of variance was attributable to individuals rather than the sampling occasion when analyzing the amount and fluency of spontaneous teacher talk. In general, conventionally reported outcomes were more stable than researcher-developed codes, which suggests some categories of teacher talk are more context dependent than others and thus require more intensive data collection to measure reliably.


Diagnostica ◽  
2004 ◽  
Vol 50 (2) ◽  
pp. 65-77 ◽  
Author(s):  
Thomas Eckes

Zusammenfassung. Leistungsbeurteilungen unterliegen einer Reihe von Urteilsfehlern, die ihre Genauigkeit und Validität erheblich mindern können. Ein besonders kritischer Urteilsfehler ist die Tendenz zur Strenge bzw. Milde. In der vorliegenden Arbeit wird mit der Multifacetten-Rasch-Analyse (“many-facet Rasch measurement“; Linacre, 1989 ; Linacre & Wright, 2002 ) ein Item-Response-Modell vorgestellt, das Messungen der Strenge bzw. Milde eines jeden Beurteilers erlaubt und die ermittelten Strengemaße zusammen mit den Fähigkeitsmaßen der beurteilten Personen und den Schwierigkeitsmaßen der Aufgaben oder Beurteilungskriterien in einen gemeinsamen Bezugsrahmen stellt. Das Modell ermöglicht ferner eine um die Strenge der Beurteiler korrigierte Leistungsmessung. Mittels dieses Ansatzes werden im Rahmen des “Test Deutsch als Fremdsprache“ (TestDaF) Beurteilungen analysiert, die je 2 von insgesamt 29 Beurteilern zu Leistungen von 1359 Pbn im schriftlichen Ausdruck nach 3 Kriterien abgegeben haben. Die Gruppe der Beurteiler erweist sich als sehr heterogen, so dass eine Strengekorrektur der Urteile geboten ist. Abschließend werden verschiedene Implikationen des Multifacetten-Rasch-Modells für die Evaluation von Leistungsbeurteilungen diskutiert.


Author(s):  
Pasquale Anselmi ◽  
Michelangelo Vianello ◽  
Egidio Robusto

Two studies investigated the different contribution of positive and negative associations to the size of the Implicit Association Test (IAT) effect. A Many-Facet Rasch Measurement analysis was applied for the purpose. Across different IATs (Race and Weight) and different groups of respondents (White, Normal weight, and Obese people) we observed that positive words increase the IAT effect whereas negative words tend to decrease it. Results suggest that the IAT is influenced by a positive associations primacy effect. As a consequence, we argue that researchers should be careful when interpreting IAT effects as a measure of implicit prejudice.


Sign in / Sign up

Export Citation Format

Share Document