scholarly journals Establishing an Operational Model of Rating Scale Construction for English Writing Assessment

2021 ◽  
Vol 15 (1) ◽  
pp. 16
Author(s):  
Xuefeng Wu

Rating scales for writing assessment are critical in that they determine directly the quality and fairness of such performance tests. However, in many EFL contexts, rating scales are made, to certain extent, based on the intuition of teachers who strongly need a feasible and scientific route to guide their construction of rating scales. This study aims to design an operational model of rating scale construction with English summary writing as an example. Altogether 325 university English teachers, 4 experts in language assessment and 60 English majors in China participated in the study. 20 textual attributes were extracted, through text analysis, from China’s Standards of English Language Ability (CSE), theoretical construct of summary writing, comments on sample summary writing essays from 8 English teachers and their personal judgement. The textual attributes were then investigated through a large-scale questionnaire survey. Exploratory factor analysis and expert judgement were employed to determine rating scale dimensions. Regression analysis and expert judgement were conducted to determine the weighting distribution across all dimensions. Based on such endeavors, a tentative operational model of rating scale construction was established, which can also be applied and adapted to develop rating scales in other writing assessment. 

2020 ◽  
Vol 21 (3) ◽  
pp. 299-313
Author(s):  
Belinda Goodenough ◽  
Jacqueline Watts ◽  
Sarah Bartlett ◽  

AbstractObjectives:To satisfy requirements for continuing professional education, workforce demand for access to large-scale continuous professional education and micro-credential-style online courses is increasing. This study examined the Knowledge Translation (KT) outcomes for a short (2 h) online course about support at night for people living with dementia (Bedtime to Breakfast), delivered at a national scale by the Dementia Training Australia (DTA).Methods:A sample of the first cohort of course completers was re-contacted after 3 months to complete a KT follow-up feedback survey (n = 161). In addition to potential practice impacts in three domains (Conceptual, Instrumental, Persuasive), respondents rated the level of Perceived Improvement in Quality of Care (PIQOC), using a positively packed global rating scale.Results:Overall, 93.8% of the respondents agreed that the course had made a difference to the support they had provided for people with dementia since the completion of the course. In addition to anticipated Conceptual impacts (e.g., change in knowledge), a range of Instrumental and Persuasive impacts were also reported, including workplace guidelines development and knowledge transfer to other staff. Tally counts for discrete KT outcomes were high (median 7/10) and explained 23% of the variance in PIQOC ratings.Conclusions:Online short courses delivered at a national scale are capable of supporting a range of translation-to-practice impacts, within the constraints of retrospective insight into personal practice change. Topics around self-assessed knowledge-to-practice and the value of positively packed rating scales for increasing variance in respondent feedback are discussed.


2020 ◽  
Vol 36 (4) ◽  
Author(s):  
Nguyen Thi Ngoc Quynh ◽  
Nguyen Thi Quynh Yen ◽  
Tran Thi Thu Hien ◽  
Nguyen Thi Phuong Thao ◽  
Bui Thien Sao ◽  
...  

Playing a vital role in assuring reliability of language performance assessment, rater training has been a topic of interest in research on large-scale testing. Similarly, in the context of VSTEP, the effectiveness of the rater training program has been of great concern. Thus, this research was conducted to investigate the impact of the VSTEP speaking rating scale training session in the rater training program provided by University of Languages and International Studies - Vietnam National University, Hanoi. Data were collected from 37 rater trainees of the program. Their ratings before and after the training session on the VSTEP.3-5 speaking rating scales were then compared. Particularly, dimensions of score reliability, criterion difficulty, rater severity, rater fit, rater bias, and score band separation were analyzed. Positive results were detected when the post-training ratings were shown to be more reliable, consistent, and distinguishable. Improvements were more noticeable for the score band separation and slighter in other aspects. Meaningful implications in terms of both future practices of rater training and rater training research methodology could be drawn from the study.


Author(s):  
Jiuliang Li ◽  
Qian Wang

AbstractSummary writing is essential for academic success, and has attracted renewed interest in academic research and large-scale language test. However, less attention has been paid to the development and evaluation of the scoring scales of summary writing. This study reports on the validation of a summary rubric that represented an approach to scale development with limited resources out of consideration for practicality. Participants were 83 students and three raters. Diagnostic evaluation of the scale components and categories was based on raters’ perception of their use and the scores of students’ summaries which were analyzed using multifaceted Rasch measurement (MFRM). Correlation analysis revealed significant relationships among the scoring components, but the coefficients among some of the components were over high. MFRM analysis provided evidence in support of the usefulness of the scoring rubric, but also suggested the need of a refinement of the components and categories. According to the raters, the rubric was ambiguous in addressing some crucial text features. This study has implications for summarization task design, scoring scale development and validation in particular.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Hyunwoo Kim

Abstract The halo effect is raters’ undesirable tendency to assign more similar ratings across rating criteria than they should. The impacts of the halo effect on ratings have been studied in rater-mediated L2 writing assessment. Little is known, however, about the extent to which rating criteria order in analytic rating scales is associated with the magnitude of the group- and individual-level halo effects. Thus, this study attempts to examine the extent to which the magnitude of the halo effect is associated with rating criteria order in analytic rating scales. To select essays untainted by the effects of rating criteria order, a balanced Latin square design was implemented along with the employment of four expert raters. Next, 11 trained novice Korean raters rated the 30 screened essays with respect to the four rating criteria in three different rating orders: standard-, reverse-, and random-order. A three-facet rating scale model (L2 writer ability, rater severity, criterion difficulty) was fitted to estimate the group- and individual-level halo effects. The overall results of this study showed that the similar magnitude of the group-level halo effect was detected in the standard- and reverse-order rating rubrics while the random presentation of rating criteria decreased the group-level halo effect. A theoretical implication of the study is the necessity of considering rating criteria order as a source of construct-irrelevant easiness or difficulty when developing analytic rating scales.


2017 ◽  
Vol 52 (2) ◽  
pp. 147-172
Author(s):  
Salena Sampson Anderson

Abstract While large-scale language and writing assessments benefit from a wealth of literature on the reliability and validity of specific tests and rating procedures, there is comparatively less literature that explores the specific language of second language writing rubrics. This paper provides an analysis of the language of performance descriptors for the public versions of the TOEFL and IELTS writing assessment rubrics, with a focus on linguistic agency encoded by agentive verbs and language of ability encoded by modal verbs can and cannot. While the IELTS rubrics feature more agentive verbs than the TOEFL rubrics, both pairs of rubrics feature uneven syntax across the band or score descriptors with either more agentive verbs for the highest scores, more nominalization for the lowest scores, or language of ability exclusively in the lowest scores. These patterns mirror similar patterns in the language of college-level classroom-based writing rubrics, but they differ from patterns seen in performance descriptors for some large-scale admissions tests. It is argued that the lack of syntactic congruity across performance descriptors in the IELTS and TOEFL rubrics may reflect a bias in how actual student performances at different levels are characterized.


2019 ◽  
pp. 10-17
Author(s):  
Olga Kvasova ◽  
Tamara Kavytska ◽  
Viktoriya Osidak

The rating process of students’ writing has been a long-standing concern in L2 large-scale standardized and classroom-based assessment. Several studies have tried to identify how the raters make scoring decisions and assign scores to ensure validity of writing assessment. The current paper addresses the issue of writing assessment practices of Ukrainian university teachers, how they approach rating scales and criteria with an attempt to understand culturally specific challenges of teachers’ writing assessment in Ukraine. To investigate the issue, this study employs the analysis of the survey results obtained from 104 university teachers of English. The survey consisted of 13 questions that provided insight into current practices in assessment of writing, such as frequency of assessment, use of rating scales, rater’s profile, criteria of assessment, feedback and rewriting, training in assessment of writing. The survey responses show that assessment in Ukraine is not regulated by common standard, and thus the approach to students’ writing assessment is often intuitive. A frequent practice is that teachers tend to rely on errors – as observable features of the text – to justify their rating decisions, Consequently, by shifting focus onto the surface features of writing, grammar mistakes in particular, the teachers underrate such criteria as “register”, “compliance with textual features” and “layout”. Additionally, the data reveal contradictory findings about writing assessment literacy of the teachers questioned. Even though most teachers claim they apply scales while rating, many confess they cannot tell the difference between holistic and analytic scales. Moreover, the results indicate that feedback is not yet a meaningful interaction between a Ukrainian teacher and a learner. Therefore, the results of the study demonstrate the need for the improvement in writing assessment practices, which could be achieved through providing training and reorientation to help Ukrainian teachers develop common understanding and interpretation of task requirements and scale features.


2012 ◽  
Vol 21 (4) ◽  
pp. 136-143
Author(s):  
Lynn E. Fox

Abstract The self-anchored rating scale (SARS) is a technique that augments collaboration between Augmentative and Alternative Communication (AAC) interventionists, their clients, and their clients' support networks. SARS is a technique used in Solution-Focused Brief Therapy, a branch of systemic family counseling. It has been applied to treating speech and language disorders across the life span, and recent case studies show it has promise for promoting adoption and long-term use of high and low tech AAC. I will describe 2 key principles of solution-focused therapy and present 7 steps in the SARS process that illustrate how clinicians can use the SARS to involve a person with aphasia and his or her family in all aspects of the therapeutic process. I will use a case study to illustrate the SARS process and present outcomes for one individual living with aphasia.


2006 ◽  
Vol 22 (4) ◽  
pp. 259-267 ◽  
Author(s):  
Eelco Olde ◽  
Rolf J. Kleber ◽  
Onno van der Hart ◽  
Victor J.M. Pop

Childbirth has been identified as a possible traumatic experience, leading to traumatic stress responses and even to the development of posttraumatic stress disorder (PTSD). The current study investigated the psychometric properties of the Dutch version of the Impact of Event Scale-Revised (IES-R) in a group of women who recently gave birth (N = 435). In addition, a comparison was made between the original IES and the IES-R. The scale showed high internal consistency (α = 0.88). Using confirmatory factor analysis no support was found for a three-factor structure of an intrusion, an avoidance, and a hyperarousal factor. Goodness of fit was only reasonable, even after fitting one intrusion item on the hyperarousal scale. The IES-R correlated significantly with scores on depression and anxiety self-rating scales, as well as with scores on a self-rating scale of posttraumatic stress disorder. Although the IES-R can be used for studying posttraumatic stress reactions in women who recently gave birth, the original IES proved to be a better instrument compared to the IES-R. It is concluded that adding the hyperarousal scale to the IES-R did not make the scale stronger.


Methodology ◽  
2011 ◽  
Vol 7 (3) ◽  
pp. 88-95 ◽  
Author(s):  
Jose A. Martínez ◽  
Manuel Ruiz Marín

The aim of this study is to improve measurement in marketing research by constructing a new, simple, nonparametric, consistent, and powerful test to study scale invariance. The test is called D-test. D-test is constructed using symbolic dynamics and symbolic entropy as a measure of the difference between the response patterns which comes from two measurement scales. We also give a standard asymptotic distribution of our statistic. Given that the test is based on entropy measures, it avoids smoothed nonparametric estimation. We applied D-test to a real marketing research to study if scale invariance holds when measuring service quality in a sports service. We considered a free-scale as a reference scale and then we compared it with three widely used rating scales: Likert-type scale from 1 to 5 and from 1 to 7, and semantic-differential scale from −3 to +3. Scale invariance holds for the two latter scales. This test overcomes the shortcomings of other procedures for analyzing scale invariance; and it provides researchers a tool to decide the appropriate rating scale to study specific marketing problems, and how the results of prior studies can be questioned.


Sign in / Sign up

Export Citation Format

Share Document