Developing rating scales for the assessment of second language performance

1996 ◽  
Vol 13 ◽  
pp. 55-79 ◽  
Author(s):  
Carolyn E. Turner ◽  
John A. Upshur

Abstract The two most common approaches to rating second language performance pose problems of reliability and validity. An alternative method utilizes rating scales that are empirically derived from samples of learner performance; these scales define boundaries between adjacent score levels rather than provide normative descriptions of ideal performances; the rating process requires making two or three binary choices about a language performance being rated. A procedure, that consists of a series of five explicit tasks, is used to construct a rating scale. The scale is designed for use with a specific population and a specific test task. A group of primary school ESL teachers used this procedure to make two speaking tests, including elicitation tasks and rating scales, for use in their school district. The tests were administered to 255 sixth grade learners. The scales were found to be highly accurate for scoring short speech samples, and were quite efficient in time required for scale development and rater training. Scales exhibit content relevance in the instructional setting. Development of this type of scale is recommended for use in high-stakes assessment.

2017 ◽  
Vol 7 (1) ◽  
pp. 47-60
Author(s):  
Kees De Bot ◽  
Fang Fang

Human behavior is not constant over the hours of the day, and there are considerable individual differences. Some people raise early and go to bed early and have their peek performance early in the day (“larks”) while others tend to go to bed late and get up late and have their best performance later in the day (“owls”). In this contribution we report on three projects on the role of chronotype (CT) in language processing and learning. The first study (de Bot, 2013) reports on the impact of CT on language learning aptitude and word learning. The second project was reported in Fang (2015) and looks at CT and executive functions, in particular inhibition as measured by variants of the Stroop test. The third project aimed at assessing lexical access in L1 and L2 at preferred and non-preferred times of the day. The data suggest that there are effects of CT on language learning and processing. There is a small effect of CT on language aptitude and a stronger effect of CT on lexical access in the first and second language. The lack of significance for other tasks is mainly caused by the large interindividual and intraindividual variation.


1998 ◽  
Vol 20 (1) ◽  
pp. 83-108 ◽  
Author(s):  
Uta Mehnert

This article reports on a study that investigated the effect of different amounts of planning time on the speech performance of L2 speakers. Subjects were 4 groups of learners of German (31 in total) performing 2 tasks each. The tasks varied in the degree of structure they contained and the familiarity of information they tapped. The control group had no planning time available; the 3 experimental groups had 1, 5, and 10 minutes of planning time, respectively, before they started speaking. Results show fluency and lexical density of speech increase as a function of planning time. Accuracy of speech improved with only 1 minute planning but did not increase with more planning time. Complexity of speech was significantly higher for the 10-minute planning condition only. No significant differences were found for the effect of planning on the different tasks. This study employed various general and specific constructs for measuring fluency, complexity, and accuracy of speech. The interrelationships and qualities of these measures are also investigated and discussed.


1996 ◽  
Vol 78 (3) ◽  
pp. 891-898
Author(s):  
Michael S. Trevisan ◽  
F. Leon Paulson

This study is the first empirical investigation of the 1964 Tversky condition applied to rating scales. The Tversky condition posits that the 3-response format will be optimum if testing time is proportional to the length of the test. To this end, 2-, 3-, 4-, and 5-response category forms of a 10-item measure of attitudes in science were randomly administered to 241 third grade students. Reliability and validity were computed for each form. No significant differences were found among the reliability coefficients or among the validity coefficients. The Tversky condition was not confirmed for rating scales. These findings are consistent with results from other studies regarding the lack of substantial differences among reliability and validity coefficients as the number of response categories in a rating scale are varied.


2010 ◽  
Vol 2 ◽  
pp. 61-68
Author(s):  
Hacer Hande Uysal

The present paper aims to provide a short historical overview of the theoretical developments in validity research in second language performance testing. A comparative description and critical evaluation of different views such as the “Trinitarian approach” versus the construct validity model; “uniform approach,” versus “unified approach” as well as alternative and critical approaches to validation in L2 performance testing are presented. These various theoretical approaches are introduced in terms of their definitions of the validity concept, their suggested requirements for the validity research, and their attitudes towards reliability and theory while making interpretations of test scores. The paper also focuses on the current problems with the applicability of these theoretical approaches, and discusses future directions in validity research.  Key words: Second language assessment, performance assessment, validity, reliability, validity research


Sign in / Sign up

Export Citation Format

Share Document