scholarly journals Similarity of the cut score in test sets with different item amounts using the modified Angoff, modified Ebel, and Hofstee standard-setting methods for the Korean Medical Licensing Examination

Author(s):  
Janghee Park ◽  
Mi Kyoung Yim ◽  
Na Jin Kim ◽  
Duck Sun Ahn ◽  
Young-Min Kim

Purpose: The Korea Medical Licensing Exam (KMLE) typically contains a large number of items. The purpose of this study was to investigate whether there is a difference in the cut score between evaluating all items of the exam and evaluating only some items when conducting standard-setting.Methods: We divided the item sets that appeared on 3 recent KMLEs for the past 3 years into 4 subsets of each year of 25% each based on their item content categories, discrimination index, and difficulty index. The entire panel of 15 members assessed all the items (360 items, 100%) of the year 2017. In split-half set 1, each item set contained 184 (51%) items of year 2018 and each set from split-half set 2 contained 182 (51%) items of the year 2019 using the same method. We used the modified Angoff, modified Ebel, and Hofstee methods in the standard-setting process.Results: Less than a 1% cut score difference was observed when the same method was used to stratify item subsets containing 25%, 51%, or 100% of the entire set. When rating fewer items, higher rater reliability was observed.Conclusion: When the entire item set was divided into equivalent subsets, assessing the exam using a portion of the item set (90 out of 360 items) yielded similar cut scores to those derived using the entire item set. There was a higher correlation between panelists’ individual assessments and the overall assessments.

Author(s):  
Duck Sun Ahn ◽  
Sowon Ahn

After briefly reviewing theories of standard setting we analyzed the problems of the current cut scores. Then, we reported the results of need assessment on the standard setting among medical educators and psychometricians. Analyses of the standard setting methods of developed countries were reported as well. Based on these findings, we suggested the Bookmark and the modified Angoff methods as alternative methods for setting standard. Possible problems and challenges were discussed when these methods were applied to the National Medical Licensing Examination.


Author(s):  
Mi Kyoung Yim ◽  
Sujin Shin

Purpose: This study explored the possibility of using the Angoff method, in which panel experts determine the cut score of an exam, for the Korean Nursing Licensing Examination (KNLE). Two mock exams for the KNLE were analyzed. The Angoff standard setting procedure was conducted and the results were analyzed. We also aimed to examine the procedural validity of applying the Angoff method in this context.Methods: For both mock exams, we set a pass-fail cut score using the Angoff method. The standard setting panel consisted of 16 nursing professors. After the Angoff procedure, the procedural validity of establishing the standard was evaluated by investigating the responses of the standard setters.Results: The descriptions of the minimally competent person for the KNLE were presented at the levels of general and subject performance. The cut scores of first and second mock exams were 74.4 and 76.8, respectively. These were higher than the traditional cut score (60% of the total score of the KNLE). The panel survey showed very positive responses, with scores higher than 4 out of 5 points on a Likert scale.Conclusion: The scores calculated for both mock tests were similar, and were much higher than the existing cut scores. In the second simulation, the standard deviation of the Angoff rating was lower than in the first simulation. According to the survey results, procedural validity was acceptable, as shown by a high level of confidence. The results show that determining cut scores by an expert panel is an applicable method.


2018 ◽  
Vol 12 (4) ◽  
pp. 15
Author(s):  
Eli Moe ◽  
Hildegunn Lahlum Helness ◽  
Craig Grocott ◽  
Norman Verhelst

Formålet med denne artikkelen er å beskrive framgangsmåten som ble brukt for å bestemme kuttskårer (grenser) mellom tre nivåer i Det europeiske ramme-verket for språk (A2, B1 og B2) på to læringsstøttende lytteprøver i engelsk for Vg1-elever. Målet har vært å undersøke om det er mulig å etablere enighet om kuttskårene, og om standardsetterne som deltok i arbeidet fikk tilstrekkelig opp-læring på forhånd. Videre var det et mål å se på hvilke konsekvenser kuttskårene vil få for fordeling av elever på de ulike rammeverksnivåene. Standardsettingen ble gjennomført med utgangspunkt i pilotdata fra 3199 elever på Vg1, Cito-metoden og 16 panelmedlemmer med god kjennskap til Rammeverkets nivåer. Flere av panelmedlemmene var eller hadde vært lærere i engelsk for elever på 10. trinn eller Vg1. Cito-metoden fungerte bra for å etablere kuttskårer som standardsetterne var forholdsvis enige om. Sluttresultatene viser at målefeilen var relativt liten. Resultatene viser større enighet om kuttskåren mellom nivåene B1 og B2 enn mellom A2 og B1, og dette kan ha en sammenheng med at det ble brukt mer tid på forberedelsesarbeid for B1 og B2. Lærere i panelet som kjenner elevgruppa godt, mener at konsekvensen kutt-skåren har for fordeling av elever på de ulike rammeverksnivåene, stemmer med deres egen vurdering av elevenes lytteferdigheter.Nøkkelord: standardsetting, testsentrert metode, Cito-metoden, standard, kutt-skår, vippekandidatStandard setting for English tests for 11th grade students in NorwayAbstractThis article presents the process used to determine the cut scores between three levels of the Common European Framework of Reference for languages (A2, B1 and B2) for two English listening tests, taken by Norwegian pupils at the 11th grade. The aim was to establish whether agreement can be reached on cut scores and whether the standard setters received enough preparation before the event. Another aim was to examine the potential consequences the cut scores would have for the distribution of pupils across the different levels. The standard setting took place using pilot data from 3199 pupils, the Cito method and 16 panel members with a good knowledge of the framework levels. Some panel members were or had been 10th or 11th grade English teachers. The Cito method worked well for establishing cut scores with which the panel members mostly agreed. The results indicated a small margin of error. The results showed a higher level of agreement for the cut score between B1 and B2 than between A2 and B1, possibly connected to the longer preparation time dedicated to B1 and B2. Teachers on the panel with good knowledge of the pupil base believe that the consequences these cut scores have for the distribution of pupils, correlate with their own experiences of pupils' ability.Keywords: standard setting, test-centered method, the Cito method, standard, cut score, borderline person / minimally competent user


PLoS ONE ◽  
2021 ◽  
Vol 16 (11) ◽  
pp. e0257871
Author(s):  
Tabea Feseker ◽  
Timo Gnambs ◽  
Cordula Artelt

In order to draw pertinent conclusions about persons with low reading skills, it is essential to use validated standard-setting procedures by which they can be assigned to their appropriate level of proficiency. Since there is no standard-setting procedure without weaknesses, external validity studies are essential. Traditionally, studies have assessed validity by comparing different judgement-based standard-setting procedures. Only a few studies have used model-based approaches for validating judgement-based procedures. The present study addressed this shortcoming and compared agreement of the cut score placement between a judgement-based approach (i.e., Bookmark procedure) and a model-based one (i.e., constrained mixture Rasch model). This was performed by differentiating between individuals with low reading proficiency and those with a functional level of reading proficiency in three independent samples of the German National Educational Panel Study that included students from the ninth grade (N = 13,897) as well as adults (Ns = 5,335 and 3,145). The analyses showed quite similar mean cut scores for the two standard-setting procedures in two of the samples, whereas the third sample showed more pronounced differences. Importantly, these findings demonstrate that model-based approaches provide a valid and resource-efficient alternative for external validation, although they can be sensitive to the ability distribution within a sample.


2021 ◽  
pp. 014662162110468
Author(s):  
Irina Grabovsky ◽  
Jesse Pace ◽  
Christopher Runyon

We model pass/fail examinations aiming to provide a systematic tool to minimize classification errors. We use the method of cut-score operating functions to generate specific cut-scores on the basis of minimizing several important misclassification measures. The goal of this research is to examine the combined effects of a known distribution of examinee abilities and uncertainty in the standard setting on the optimal choice of the cut-score. In addition, we describe an online application that allows others to utilize the cut-score operating function for their own standard settings.


Author(s):  
Guemin Lee

National Health Personnel Licensing Examination Board (hereafter NHPLEB) has used 60% correct responses of overall tests and 40% correct responses of each subject area test as a criterion to give physician licenses to satisfactory candidates. The 60%-40% criterion seems reasonable to laypersons without pychometric or measurement knowledge, but it may causes several severe problems on pychometrician's perspective. This paper pointed out several problematic cases that can be encountered by using the 60%-40% criterion, and provided several pychometric alternatives that could overcome these problems. A fairly new approach, named Bookmark standard setting method, was introduced and explained in detail as an example. This paper concluded with five considerations when the NHPLEB decides to adopt a pychometric standard setting approach to set a cutscore for a licensure test like medical licensing examination.


2020 ◽  
Author(s):  
majid yousefi afrashteh

Abstract Introduction One of the main processes in evaluating of the students’ performance is standard staging to determine the passage for the test. The purpose of this study was to compare the validity of two methods of Angoff and bookmark in standard setting. Method Participants included 190 master’s students graduated in laboratory sciences since past year. Designed by a group of experts, a performance test with 32 item was used in this study to assess laboratory skills of graduates of medical laboratory sciences. Moreover, two groups of experts voluntarily participated in this study to set the cut-score. To assess the process validity, a 5-item questionnaire was asked from two groups of penists. To investigate the internal validity, the variance of the cut scores determined by the members of the two panels was compared with the F ratio. External validity was assessed by using four indices of correlation test with criterion score. Results Comparison of the two methods of Angoff and bookmarking showed that the mean of process validity indices was higher in bookmarking method. In order to assess internal validity, conclusion: Homogeneity of results and co-ordination of judges' scores were considered. Conclusion In evaluating of the external validity (concordance of the cut score with the criterion score), All five external validity indices supported the bookmark method.


2016 ◽  
Vol 41 (6 (Suppl. 2)) ◽  
pp. S74-S82 ◽  
Author(s):  
Bruno D. Zumbo

A critical step in the development and use of tests of physical fitness for employment purposes (e.g., fitness for duty) is to establish 1 or more cut points, dividing the test score range into 2 or more ordered categories reflecting, for example, fail/pass decisions. Over the last 3 decades elaborated theories and methods have evolved focusing on the process of establishing 1 or more cut-scores on a test. This elaborated process is widely referred to as “standard-setting”. As such, the validity of the test score interpretation hinges on the standard-setting, which embodies the purpose and rules according to which the test results are interpreted. The purpose of this paper is to provide an overview of standard-setting methodology. The essential features, key definitions and concepts, and various novel methods of informing standard-setting will be described. The focus is on foundational issues with an eye toward informing best practices with new methodology. Throughout, a case is made that in terms of best practices, establishing a test standard involves, in good part, setting a cut-score and can be conceptualized as evidence/data-based policy making that is essentially tied to test validity and an evidential trail.


2015 ◽  
Vol 7 (4) ◽  
pp. 610-616 ◽  
Author(s):  
Mei Liang ◽  
Laurie S. Curtin ◽  
Mona M. Signer ◽  
Maria C. Savoia

ABSTRACT Background  Over the past decade, the number of unfilled positions in the National Resident Matching Program (NRMP) Main Residency Match has declined by one-third, while the number of unmatched applicants has grown by more than 50%, largely due to a rise in the number of international medical school students and graduates (IMGs). Although only half of IMG participants historically have matched to a first-year position, the Match experiences of unmatched IMGs have not been studied. Objective  We examined differences in interview and ranking behaviors between matched and unmatched IMGs participating in the 2013 Match and explored strategic errors made by unmatched IMGs when creating rank order lists. Methods  Rank order lists of IMGs who failed to match were analyzed in conjunction with their United States Medical Licensing Examination (USMLE) Step 1 scores and responses on the 2013 NRMP Applicant Survey. IMGs were categorized as “strong,” “solid,” “marginal,” or “weak” based on the perceived competitiveness of their USMLE Step 1 scores compared to other IMG applicants who matched in the same specialty. We examined ranking preferences and strategies by Match outcome. Results  Most unmatched IMGs were categorized as “marginal” or “weak”. However, unmatched IMGs who were non-US citizens presented more competitive USMLE Step 1 scores compared to unmatched IMGs who were US citizens. Unmatched IMGs were more likely than matched IMGs to rank programs at which they did not interview and to rank programs based on their perceived likelihood of matching. Conclusions  The interview and ranking behaviors of IMGs can have far-reaching consequences on their Match experience and outcomes.


Sign in / Sign up

Export Citation Format

Share Document