Angoff and Nedelsky Standard Setting Procedures: Implications for the Validity of Proficiency Test Score Interpretation

A critical step in the development and use of tests of physical fitness for employment purposes (e.g., fitness for duty) is to establish 1 or more cut points, dividing the test score range into 2 or more ordered categories reflecting, for example, fail/pass decisions. Over the last 3 decades elaborated theories and methods have evolved focusing on the process of establishing 1 or more cut-scores on a test. This elaborated process is widely referred to as “standard-setting”. As such, the validity of the test score interpretation hinges on the standard-setting, which embodies the purpose and rules according to which the test results are interpreted. The purpose of this paper is to provide an overview of standard-setting methodology. The essential features, key definitions and concepts, and various novel methods of informing standard-setting will be described. The focus is on foundational issues with an eye toward informing best practices with new methodology. Throughout, a case is made that in terms of best practices, establishing a test standard involves, in good part, setting a cut-score and can be conceptualized as evidence/data-based policy making that is essentially tied to test validity and an evidential trail.

Download Full-text

The domain expert perspective: A qualitative study into the views expressed in a standard-setting exercise on a language for specific purposes (LSP) test for health professionals

Language Testing ◽

10.1177/02655322211010737 ◽

2021 ◽

pp. 026553222110107

Author(s):

Simon Davidson

Keyword(s):

Qualitative Data ◽

Proficiency Test ◽

Standard Setting ◽

Domain Expert ◽

Verbal Reports ◽

Domain Experts ◽

Language For Specific Purposes ◽

Setting Process ◽

Test Criteria ◽

Textual Features

This paper investigates what matters to medical domain experts when setting standards on a language for specific purposes (LSP) English proficiency test: the Occupational English Test’s (OET) writing sub-test. The study explores what standard-setting participants value when making performance judgements about test candidates’ writing responses, and the extent to which their decisions are language-based and align with the OET writing sub-test criteria. Qualitative data is a relatively under-utilized component of standard setting and this type of commentary was garnered to gain a better understanding of the basis for performance decisions. Eighteen doctors were recruited for standard-setting workshops. To gain further insight, verbal reports in the form of a think-aloud protocol (TAP) were employed with five of the 18 participants. The doctors’ comments were thematically coded and the analysis showed that participants’ standard-setting judgements often aligned with the OET writing sub-test criteria. An overarching theme, ‘Audience Recognition’, was also identified as valuable to participants. A minority of decisions were swayed by features outside the OET’s communicative construct (e.g., clinical competency). Yet, overall, findings indicated that domain experts were undeniably focused on textual features associated with what the test is designed to assess and their views were vitally important in the standard-setting process.

Download Full-text

Assessing learning progress: validating a test score interpretation in the domain of sustainability management

Studies in Higher Education ◽

10.1080/03075079.2021.1953329 ◽

2021 ◽

pp. 1-16

Author(s):

Christine Aichele ◽

Johannes Hartig ◽

Christian Michaelis

Keyword(s):

Test Score ◽

Sustainability Management ◽

Score Interpretation ◽

Learning Progress

Download Full-text

The One and the Many: Enduring Legacies of Spearman and Thurstone on Intelligence Test Score Interpretation

Applied Measurement in Education ◽

10.1080/08957347.2019.1619560 ◽

2019 ◽

Vol 32 (3) ◽

pp. 198-215 ◽

Cited By ~ 3

Author(s):

A. Alexander Beaujean ◽

Nicholas F. Benson

Keyword(s):

Test Score ◽

Intelligence Test ◽

Score Interpretation ◽

The Many ◽

The One

Download Full-text

Generalizability of Speechreading Performance on Nonsense Syllables, Words, and Sentences

Journal of Speech Language and Hearing Research ◽

10.1044/jshr.3904.697 ◽

1996 ◽

Vol 39 (4) ◽

pp. 697-713 ◽

Cited By ~ 17

Author(s):

Marilyn E. Demorest ◽

Lynne E. Bernstein ◽

Gale P. DeHaven

Keyword(s):

Test Score ◽

Test Construction ◽

Normal Hearing ◽

Test Length ◽

Alternative Models ◽

Test Items ◽

Score Interpretation ◽

Visual Distance ◽

Generalizability Analysis ◽

Nonsense Syllables

Ninety-six adults with normal hearing viewed three types of recorded speechreading materials (consonant-vowel nonsense syllables, isolated words, and sentences) on 2 days. Responses to nonsense syllables were scored for syllables correct and syllable groups correct; responses to words and sentences were scored in terms of words correct, phonemes correct, and an estimate of visual distance between the stimulus and the response. Generalizability analysis was used to quantify sources of variability in performance. Subjects and test items were important sources of variability for all three types of materials; effects of talker and day of testing varied but were comparatively small. For each type of material, alternative models of test construction and test-score interpretation were evaluated through estimation of generalizability coefficients as a function of test length. Performance on nonsense syllables correlated about .50 with both word and sentence measures, whereas correlations between words and sentences typically exceeded .80.

Download Full-text

Test Score Interpretation

Incorporating Progress Monitoring and Outcome Assessment into Counseling and Psychotherapy ◽

10.1093/med:psych/9780199356676.003.0004 ◽

2014 ◽

pp. 63-74

Author(s):

Scott T. Meier

Keyword(s):

Test Score ◽

Score Interpretation

Download Full-text

AN INTERSECTION OF TEST SCORE INTERPRETATION AND ITEM ANALYSIS1

Journal of Educational Measurement ◽

10.1111/j.1745-3984.1964.tb00144.x ◽

1964 ◽

Vol 1 (1) ◽

pp. 23-28 ◽

Cited By ~ 8

Author(s):

Frank B. Baker

Keyword(s):

Test Score ◽

Score Interpretation

Download Full-text

Adapting Instruments for Use in Multiple Languages and Cultures: A Review of the ITC Guidelines for Test Adaptations

European Journal of Psychological Assessment ◽

10.1027//1015-5759.15.3.258 ◽

1999 ◽

Vol 15 (3) ◽

pp. 258-269 ◽

Cited By ~ 47

Author(s):

Norbert K. Tanzer ◽

Catherine Q.E. Sim

Keyword(s):

Content Analysis ◽

Test Score ◽

Code Of Conduct ◽

Current Version ◽

Score Interpretation ◽

Recommended Practices ◽

Multiple Languages

Summary: To facilitate the development of valid multicultural/multilingual tests, the International Test Commission (ITC) prepared the ITC Guidelines on Test Adaptations. This paper reviews the current version (cf. Van de Vijver & Hambleton, 1996 ), which consists of 22 guidelines on recommended practices pertaining to context, development, administration, documentation, and test-score interpretation, by identifying key principles in test adaptations and comparing them to a content analysis of the ITC Guidelines. The content analysis revealed a number of inconsistencies and ambiguities in a few guidelines, and proposals for reformulating them are given. A checklist to supplement the more narrative guidelines would also be helpful. Nevertheless, the review clearly demonstrates that the ITC Guidelines on Test Adaptations address key principles in test adaptations and constitute a significant standard or “code of conduct” in this field.

Download Full-text