scholarly journals Examination of the Quality of Multiple-choice Items on Classroom Tests

Author(s):  
David DiBattista ◽  
Laura Kurzawa

Because multiple-choice testing is so widespread in higher education, we assessed the quality of items used on classroom tests by carrying out a statistical item analysis. We examined undergraduates’ responses to 1198 multiple-choice items on sixteen classroom tests in various disciplines. The mean item discrimination coefficient was +0.25, with more than 30% of items having unsatisfactory coefficients less than +0.20. Of the 3819 distractors, 45% were flawed either because less than 5% of examinees selected them or because their selection was positively rather than negatively correlated with test scores. In three tests, more than 40% of the items had an unsatisfactory discrimination coefficient, and in six tests, more than half of the distractors were flawed. Discriminatory power suffered dramatically when the selection of one or more distractors was positively correlated with test scores, but it was only minimally affected by the presence of distractors that were selected by less than 5% of examinees. Our findings indicate that there is considerable room for improvement in the quality of many multiple-choice tests. We suggest that instructors consider improving the quality of their multiple-choice tests by conducting an item analysis and by modifying distractors that impair the discriminatory power of items. Étant donné que les examens à choix multiple sont tellement généralisés dans l’enseignement supérieur, nous avons effectué une analyse statistique des items utilisés dans les examens en classe afin d’en évaluer la qualité. Nous avons analysé les réponses des étudiants de premier cycle à 1198 questions à choix multiples dans 16 examens effectués en classe dans diverses disciplines. Le coefficient moyen de discrimination de l’item était +0.25. Plus de 30 % des items avaient des coefficients insatisfaisants inférieurs à + 0.20. Sur les 3819 distracteurs, 45 % étaient imparfaits parce que moins de 5 % des étudiants les ont choisis ou à cause d’une corrélation négative plutôt que positive avec les résultats des examens. Dans trois examens, le coefficient de discrimination de plus de 40 % des items était insatisfaisant et dans six examens, plus de la moitié des distracteurs était imparfaits. Le pouvoir de discrimination était considérablement affecté en cas de corrélation positive entre un distracteur ou plus et les résultatsde l’examen, mais la présence de distracteurs choisis par moins de 5 % des étudiants avait une influence minime sur ce pouvoir. Nos résultats indiquent que les examens à choix multiple peuvent être considérablement améliorés. Nous suggérons que les enseignants procèdent à une analyse des items et modifient les distracteurs qui compromettent le pouvoir de discrimination des items.

1979 ◽  
Vol 1 (2) ◽  
pp. 24-33 ◽  
Author(s):  
James R. McMillan

Most educators agree that classroom evaluation practices need improvement. One way to improve testing is to use high-quality objective multiple-choice exams. Almost any understanding or ability which can be tested by another test form can also be tested by means of multiple-choice items. Based on a survey of 173 respondents, it appears that marketing teachers are disenchanted with multiple-choice questions and use them sparingly. Further, their limited use is largely in the introductory marketing course even though there are emerging pressures for universities to take a closer look at the quality of classroom evaluation at all levels.


2010 ◽  
Vol 26 (4) ◽  
pp. 302-308 ◽  
Author(s):  
Klaus D. Kubinger ◽  
Christine Wolfsbauer

Test authors may think about adding the response options “I don’t know the solution” and “none of the other options is correct” in order to reduce a high guessing probability for multiple-choice items. However, in this paper it was expected that different types of personality would use these response options differently, as a consequence of which they would do more or less guessing and, therefore, achieve higher or lower test scores, on average. An experiment was performed based on randomizing participants into two groups, one of them being warned that it is better to admit being unable to solve the item, and the participants were classified according to their personality scores into high-, medium-, and low-scoring. Multivariate analyses of variance (195 pupils between 14 and 19 years) disclosed that only Openness to Experience showed any (moderate) effect, and even this only for a single subtest (Cattell’s culture fair test).


2015 ◽  
Vol 166 (2) ◽  
pp. 278-306 ◽  
Author(s):  
Henrik Gyllstad ◽  
Laura Vilkaitė ◽  
Norbert Schmitt

In most tests of vocabulary size, knowledge is assessed through multiple-choice formats. Despite advantages such as ease of scoring, multiple-choice tests (MCT) are accompanied with problems. One of the more central issues has to do with guessing and the presence of other construct-irrelevant strategies that can lead to overestimation of scores. A further challenge when designing vocabulary size tests is that of sampling rate. How many words constitute a representative sample of the underlying population of words that the test is intended to measure? This paper addresses these two issues through a case study based on data from a recent and increasingly used MCT of vocabulary size: the Vocabulary Size Test. Using a criterion-related validity approach, our results show that for multiple-choice items sampled from this test, there is a discrepancy between the test scores and the scores obtained from the criterion measure, and that a higher sampling rate would be needed in order to better represent knowledge of the underlying population of words. We offer two main interpretations of these results, and discuss their implications for the construction and use of vocabulary size tests.


Author(s):  
Matthew V Pachai ◽  
David DiBattista ◽  
Joseph A Kim

Multiple choice writing guidelines are decidedly split on the use of ‘none of the above’ (NOTA), with some authors discouraging and others advocating its use. Moreover, empirical studies of NOTA have produced mixed results. Generally, these studies have utilized NOTA as either the correct response or a distractor and assessed its effect on difficulty and discrimination. In these studies, NOTA commonly yields increased difficulty when it is used as the correct response, and no change in discrimination regardless of usage. However, when NOTA is implemented as a distractor, rarely is consideration given to the distractor that could have been written in its place. Here, we systematically replaced each distractor in a series of questions with NOTA across different versions of an Introductory Psychology examination. This approach allowed us to quantify the quality of each distractor based on its relative discrimination index and assess the effect of NOTA relative to the quality of distractor it replaced. Moreover, our use of large Introductory Psychology examinations afforded highly stable difficulty and discrimination estimates. We found that NOTA increased question difficulty only when it was the correct response, with no effect on difficulty of replacing any distractor type with NOTA. Moreover, we found that NOTA decreased discrimination when it replaced the most effective distractors, with no effect on discrimination of replacing either the correct response or lowest quality distractor with NOTA. These results replicate the common finding that inclusion of NOTA as the correct response increases question difficulty by equally luring high-performing and low-performing students toward distractors. Moreover, we have shown that including NOTA as a distractor can reduce discrimination if used in lieu of a well written alternative, suggesting that multiple choice authors should avoid using NOTA on multiple choice tests. Les guides de rédaction pour les questions à choix multiple sont incontestablement partagés sur l’usage de la réponse « aucune des situations ci-dessus ». Certains auteurs déconseillent de l’employer alors que d’autres préconisent de le faire. De plus, des études empiriques de l’emploi de cette expression ont mené à des résultats mitigés. En général, ces études ont utilisé l’option « aucune des situations ci-dessus » soit comme étant la réponse correcte soit comme distracteur et ont évalué ses effets sur la difficulté et le discernement. Dans ces études, la réponse « aucune des situations ci-dessus » mène généralement à une augmentation de la difficulté quand l’option est employée comme étant la réponse correcte et il n’y a aucun changement en ce qui concerne le discernement, quel que soit son usage. Toutefois, quand ce type d’option de réponse est utilisé en tant que distracteur, on trouve rarement une justification à l’emploi d’un distracteur qui aurait pu être utilisé à sa place. Dans le cas présent, nous avons systématiquement remplacé chaque distracteur dans une série de questions contenant l’option « aucune des situations ci-dessus » dans différentes versions d’un examen d’introduction à la psychologie. Cette approche nous a permis de quantifier la qualité de chaque distracteur sur la base de l’indice de son discernement relatif et d’évaluer les effets de l’option relative sur la qualité du distracteur qu’elle remplaçait. De plus, puisque nous avons utilisé de grands examens d’introduction à la psychologie, cela nous a permis de faire des estimations de la difficulté et du discernement hautement stables. Nous avons trouvé que l’emploi de l’option « aucune des situations ci-dessus » augmentait la difficulté de la question seulement lorsque cette option était la réponse correcte et qu’il n’avait aucun effet sur la difficulté présente lorsqu’on remplaçait n’importe quel type de distracteur par l’option « aucune des situations ci-dessus ». En outre, nous avons trouvé que l’emploi de l’option « aucune des situations ci-dessus » diminuait le discernement quand elle remplaçait les distracteurs les plus efficaces et qu’elle n’avait aucun effet sur le discernement quand elle remplaçait soit la réponse correcte soit le distracteur le moins plausible par l’option « aucune des situations ci-dessus ». Ces résultats reproduisent les conclusions communes selon lesquelles l’emploi de l’option « aucune des situations ci-dessus » comme étant la réponse correcte augmente la difficulté de la question car dans ce cas, tant les étudiants brillants que les étudiants médiocres sont leurrés de façon identique vers les distracteurs. Par surcroît, nous avons montré que le fait d’inclure l’option « aucune des situations ci-dessus » en tant que distracteur pouvait réduire le discernement si on l’utilisait à la place d’une alternative de réponse bien rédigée, ce qui suggère que les auteurs de questions à choix multiples devraient éviter d’utiliser l’option « aucune des situations ci-dessus » dans les examens à choix multiples.


2020 ◽  
Vol 1 (2) ◽  
pp. 122-138
Author(s):  
Husnani Aliah

The research aimed at finding out information about the preparation of constructing teacher-made tests in Enrekang, the quality of English teacher-made test according to item analysis, and the level cognitive domain of the teacher-made test. The test quality was determined after it was used in school examination test. This research employed survey research using descriptive method. The researcher analyzed the data and then described the research finding quantitatively. The population of this research was the teachers who teach in ninth grade at junior high schools in Enrekang. This research applied simple random sampling technique by taking four different schools as sampel. The results of analysis show preparation that junior high school teachers follow in constructing teacher-made tests in Enrekang is divided into five main parts. In preparing the test, the procedures were considering tests’ materials and proportion of each topic, choosing to check the item bank that match to syllabus and indicators, or preparing test specification. In writing test, teachers’ procedures were re-writing chosen test item from internet and textbook, re-writing items that was used before and allowing the other teachers to verify it, combining items from item bank and text book, or making new item. While in analyzing a test, the procedures used by the teachers were analyzing and revising test based on its item difficulty, predicting the item difficulty and revising the test, or doing nothing to analyze the test. About the timing in preparing the test, there are three out of five teachers who need only one week to construct multiple choice tests. Besides, there are two out of five teachers who need two weeks to construct multiple choice tests. While the teachers have different ways in providing test based on students’ ability. Moreover, the item analysis shows that no test is perfectly good. It was found that almost all tests need to be revised. It was also found that there were only three categories works in all tests based on the cognitive domain of the test namely knowledge, comprehension, and application categories. There was no item belong to analysis, synthesis, and evaluation categories.


2019 ◽  
Vol 14 (26) ◽  
pp. 51-65
Author(s):  
Lotte Dyhrberg O'Neill ◽  
Sara Mathilde Radl Mortensen ◽  
Cita Nørgård ◽  
Anne Lindebo Holm Øvrehus ◽  
Ulla Glenert Friis

Construction errors in multiple-choice items are quite prevalent and constitute threats to test validity of multiple-choice tests. Currently very little research on the usefulness of systematic item screening by local review committees before test administration seem to exist. The aim of this study was therefore to examine validity and feasibility aspects of review committee screening for item flaws. We examined the reliability of item reviewers’ independent judgments of the presence/absence of item flaws with a generalizability study design and found only moderate reliability using five reviewers. Statistical analyses of actual exam scores could be a more efficient way of identifying flaws and improving average item discrimination of tests in local contexts. The question of validity of human judgments of item flaws is important - not just for sufficiently sound quality assurance procedures of tests in local test contexts - but also for the global research on item flaws.


1966 ◽  
Vol 19 (2) ◽  
pp. 651-654 ◽  
Author(s):  
Donald W. Zimmerman ◽  
Richard H. Williams ◽  
Hubert H. Rehm ◽  
William Elmore

College students were instructed to indicate on various multiple-choice tests whether they “knew the answer” or “guessed” each item, and the results were treated as estimated true and error components of scores. The values of the intercorrelations of these components were similar to those given by a computer program described previously. The values found for all tests were consistent with the assumption that test scores consist of both independent and non-independent components of error and that the non-independent error component is relatively large.


1971 ◽  
Vol 29 (3_suppl) ◽  
pp. 1229-1230
Author(s):  
Carrie Wherry Waters ◽  
L. K. Waters

Reactions of examinees to 2 scoring instructions were evaluated for 2-, 3-, and 5-alternative multiple-choice items. Examinees were more favorable toward the “reward for omitted items” than the “penalty for wrongs” instructions across all numbers of item alternatives.


2019 ◽  
Vol 94 (5) ◽  
pp. 740
Author(s):  
Valérie Dory ◽  
Kate Allan ◽  
Leora Birnbaum ◽  
Stuart Lubarsky ◽  
Joyce Pickering ◽  
...  

2019 ◽  
pp. 1
Author(s):  
Valérie Dory ◽  
Kate Allan ◽  
Leora Birnbaum ◽  
Stuart Lubarsky ◽  
Joyce Pickering ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document