scholarly journals The affectability of writing assessment scores: a G-theory analysis of rater, task, and scoring method contribution

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ali Khodi

AbstractThe present study attempted to to investigate  factors  which affect EFL writing scores through using generalizability theory (G-theory). To this purpose, one hundred and twenty students participated in one independent and one integrated writing tasks. Proceeding, their performances were scored by six raters: one self-rating,  three peers,-rating and two instructors-rating. The main purpose of the sudy was to determine the relative and absolute contributions of different facets such as student, rater, task, method of scoring, and background of education  to the validity of writing assessment scores. The results indicated three major sources of variance: (a) the student by task by method of scoring (nested in background of education) interaction (STM:B) with 31.8% contribution to the total variance, (b) the student by rater by task by method of scoring (nested in background of education) interaction (SRTM:B) with 26.5% of contribution to the total variance, and (c) the student by rater by method of scoring (nested in background of education) interaction (SRM:B) with 17.6% of the contribution. With regard to the G-coefficients in G-study (relative G-coefficient ≥ 0.86), it was also found that the result of the assessment was highly valid and reliable. The sources of error variance were detected as the student by rater (nested in background of education) (SR:B) and rater by background of education with 99.2% and 0.8% contribution to the error variance, respectively. Additionally, ten separate G-studies were conducted to investigate the contribution of different facets across rater, task, and methods of scoring as differentiation facet. These studies suggested that peer rating, analytical scoring method, and integrated writing tasks were the most reliable and generalizable designs of the writing assessments. Finally, five decision-making studies (D-studies) in optimization level were conducted and it was indicated that at least four raters (with G-coefficient = 0.80) are necessary for a valid and reliable assessment. Based on these results, to achieve the greatest gain in generalizability, teachers should have their students take two writing assessments and their performance should be rated on at least two scoring methods by at least four raters.

2003 ◽  
Vol 92 (3) ◽  
pp. 1015-1021 ◽  
Author(s):  
Kevin D. Crehan ◽  
Mary Curfman

The effect of timely feedback for a state writing assessment on subsequent writing performance was investigated. Also, agreement between teachers' scores on assessment and the state department's scores was compared. Eighth grade English teachers ( N = 8) were trained on an analytic scoring method which yielded scores on ideas, organization, voice, and conventions. September state writing assessments from the teachers' class were scored by the teachers who also scored assessments for a partner teacher's class. A second parallel writing assessment was administered in February to the trained teachers' classes and eight control classes. Analysis showed good agreement between the teachers' scores and those by the state department. There was 75% agreement on the designation of adequate or inadequate for the students' writing between the teachers and the state department. There was no difference between the writing performance for students of the trained teachers and students in the control classes on the follow-up assessment.


2001 ◽  
Vol 6 (4) ◽  
pp. 272-286 ◽  
Author(s):  
Bogdan Zawadzki ◽  
Jan Strelau ◽  
WŽodzimierz Oniszczenko ◽  
Rainer Riemann ◽  
Alois Angleitner

Among the criteria for a personality paradigm the following three are the most crucial: biological basis (e. g., genetic contribution to the phenotypic variance), universality (existence of traits in different cultures) and reality (possibility to measure traits by different methods). The present study combines all three criteria to explore the impact of genetic and environmental factors on temperamental traits, as stipulated by Strelau 's regulative theory of temperament, across two culturally different samples (Polish and German), and by means of two diagnostic methods (self-report and peer-rating). The analysis was conducted on data obtained from 1009 same-sex pairs of twins (German sample) and 546 same-sex pairs of twins (Polish sample). For each subject the self-report as well as rating from two independent peers was recorded by both the Polish and German versions of the Formal Characteristics of Behavior-Temperament Inventory. Results demonstrate substantial heritability of temperamental traits, although average peer-rating tends to provide lower heritability estimates than self-report (for six traits M = 33% and M = 46% of the total variance, respectively). After separating the error variance from the effect of nonshared environment for both methods (self-report and peer-rating) and both samples, joint analysis indicated a very high impact of genetic factors (the average raised up to 66% of the total variance). No significant “sample” effect was found, so that we may allows to conclude that temperamental traits are determined in both cultures to the same extent by genetic factors.


2009 ◽  
Vol 31 (1) ◽  
pp. 81
Author(s):  
Takeaki Kumazawa

Classical test theory (CTT) has been widely used to estimate the reliability of measurements. Generalizability theory (G theory), an extension of CTT, is a powerful statistical procedure, particularly useful for performance testing, because it enables estimating the percentages of persons variance and multiple sources of error variance. This study focuses on a generalizability study (G study) conducted to investigate such variance components for a paper-pencil multiple-choice vocabulary test used as a diagnostic pretest. Further, a decision study (D study) was conducted to compute the generalizability coefficient (G coefficient) for absolute decisions. The results of the G and D studies indicated that 46% of the total variance was due to the items effect; further, the G coefficient for absolute decisions was low. 古典的テスト理論は尺度の信頼性を測定するため広く用いられている。古典的テスト理論の応用である一般化可能性理論(G理論)は特にパフォーマンステストにおいて有効な分析手法であり、受験者と誤差の要因となる分散成分の割合を測定することができる。本研究では診断テストとして用いられた多岐選択式語彙テストの分散成分を測定するため一般化可能性研究(G研究)を行った。さらに、決定研究(D研究)では絶対評価に用いる一般化可能性係数を算出した。G研究とD研究の結果、項目の分散成分が全体の分散の46%を占め、また信頼度指数は高くなかった。


2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Inga Hege ◽  
Isabel Kiesewetter ◽  
Martin Adler

Abstract Background The ability to compose a concise summary statement about a patient is a good indicator for the clinical reasoning abilities of healthcare students. To assess such summary statements manually a rubric based on five categories - use of semantic qualifiers, narrowing, transformation, accuracy, and global rating has been published. Our aim was to explore whether computer-based methods can be applied to automatically assess summary statements composed by learners in virtual patient scenarios based on the available rubric in real-time to serve as a basis for immediate feedback to learners. Methods We randomly selected 125 summary statements in German and English composed by learners in five different virtual patient scenarios. Then we manually rated these statements based on the rubric plus an additional category for the use of the virtual patients’ name. We implemented a natural language processing approach in combination with our own algorithm to automatically assess 125 randomly selected summary statements and compared the results of the manual and automatic rating in each category. Results We found a moderate agreement of the manual and automatic rating in most of the categories. However, some further analysis and development is needed, especially for a more reliable assessment of the factual accuracy and the identification of patient names in the German statements. Conclusions Despite some areas of improvement we believe that our results justify a careful display of the computer-calculated assessment scores as feedback to the learners. It will be important to emphasize that the rating is an approximation and give learners the possibility to complain about supposedly incorrect assessments, which will also help us to further improve the rating algorithms.


2000 ◽  
Vol 26 (4) ◽  
pp. 813-835 ◽  
Author(s):  
Winfred Arthur ◽  
David J. Woehr ◽  
Robyn Maldegen

This study notes that the lack of convergent and discriminant validity of assessment center ratings in the presence of content-related and criterion-related validity is paradoxical within a unitarian framework of validity. It also empirically demonstrates an application of generalizability theory to examining the convergent and discriminant validity of assessment center dimensional ratings. Generalizability analyses indicated that person, dimension, and person by dimension effects contribute large proportions of variance to the total variance in assessment center ratings. Alternately, exercise, rater, person by exercise, and dimension by exercise effects are shown to contribute little to the total variance. Correlational and confirmatory factor analyses results were consistent with the generalizability results. This provides strong evidence for the convergent and discriminant validity of the assessment center dimension ratings–a finding consistent with the conceptual underpinnings of the unitarian view of validity and inconsistent with previously reported results. Implications for future research and practice are discussed.


Author(s):  
Felix D. Schönbrodt ◽  
Caroline Zygar-Hoffmann ◽  
Steffen Nestler ◽  
Sebastian Pusch ◽  
Birk Hagemeyer

AbstractThe investigation of within-person process models, often done in experience sampling designs, requires a reliable assessment of within-person change. In this paper, we focus on dyadic intensive longitudinal designs where both partners of a couple are assessed multiple times each day across several days. We introduce a statistical model for variance decomposition based on generalizability theory (extending P. E. Shrout & S. P. Lane, 2012), which can estimate the relative proportion of variability on four hierarchical levels: moments within a day, days, persons, and couples. Based on these variance estimates, four reliability coefficients are derived: between-couples, between-persons, within-persons/between-days, and within-persons/between-moments. We apply the model to two dyadic intensive experience sampling studies (n1 = 130 persons, 5 surveys each day for 14 days, ≥ 7508 unique surveys; n2 = 508 persons, 5 surveys each day for 28 days, ≥ 47764 unique surveys). Five different scales in the domain of motivational processes and relationship quality were assessed with 2 to 5 items: State relationship satisfaction, communal motivation, and agentic motivation; the latter consists of two subscales, namely power and independence motivation. Largest variance components were on the level of persons, moments, couples, and days, where within-day variance was generally larger than between-day variance. Reliabilities ranged from .32 to .76 (couple level), .93 to .98 (person level), .61 to .88 (day level), and .28 to .72 (moment level). Scale intercorrelations reveal differential structures between and within persons, which has consequences for theory building and statistical modeling.


2019 ◽  
Author(s):  
Khawater Fahad Alshalan

This study aims to investigate how frequently Halliday and Hasan’s (1976) cohesive devices were used as well as their relationship with the writing quality of 100 Saudi EFL undergraduate students in Al Imam Muhammed Ibn Saud Islamic University, Riyadh, Saudi Arabia. It uses a mixed method approach, where the students’ essays were analyzed using systemic functional linguistics (SFL) in terms of the textual meta-function of cohesive devices. The five types of the cohesive devices are the following: lexical cohesion, reference, conjunction, substitutions, and ellipses. Moreover, each of their subcategories were analyzed in the students’ texts. The NVivo qualitative data analysis software and the corpus analysis (conducted using AntConc) were used to calculate the frequencies of each cohesive device found in the data. The IELTS writing assessment scale was also used to evaluate the students’ writing scores. The results show that the most frequently used device was lexical cohesion, specifically repetition. Saudi EFL undergraduate students tended to repeatedly stay focused on the central idea of the topic. Furthermore, Pearson’s correlation coefficient found a relationship between the students’ writing scores and length of their essays, the use of cohesive ties and the scores, and cohesive ties and the length of the students’ essays. This study recommends that EFL teachers provide Saudi EFL students several cohesive tools in order to help them improve their writing skills and connect their ideas smoothly.


Mindfulness ◽  
2020 ◽  
Author(s):  
Oleg N. Medvedev ◽  
Anastasia T. Dailianis ◽  
Yoon-Suk Hwang ◽  
Christian U. Krägeloh ◽  
Nirbhay N. Singh

2013 ◽  
Vol 30 (1) ◽  
pp. 59-69 ◽  
Author(s):  
Elizabeth A. Holbrook ◽  
Minsoo Kang ◽  
Don W. Morgan

As a first step toward the development of adapted physical activity (PA) programs for adults with visual impairment (VI), the purpose of this study was to determine the time frame needed to reliably estimate weekly PA in adults with VI. Thirty-three adults with VI completed 7 days of pedometer-based PA assessment. Generalizability theory analyses were conducted to quantify sources of variance within the PA estimate and determine the appropriate number of days of PA monitoring needed for the total sample and for participants with mild-to-moderate and severe VI. A single-facet, crossed design was employed including participants and days. Participants and days correspondingly accounted for 33–55% and 0–3% of the total variance in PA. While a reliable account of PA was obtained for the total sample over a 6-day period, shorter (4-day) and longer (9-day) periods were required for persons with mild-to-moderate and severe VI, respectively.


1993 ◽  
Vol 18 (2) ◽  
pp. 197-206 ◽  
Author(s):  
George A. Marcoulides

Generalizability theory provides a framework for examining the dependability of behavioral measurements. When designing generalizability studies, two important statistical issues are generally considered: power and measurement error. Control over power and error of measurement can be obtained by manipulation of sample size and/or test reliability. In generalizability theory, the mean error variance is an estimate that takes into account both these statistical issues. When limited resources are available, determining an optimal measurement design is not a simple task. This article presents a methodology for minimizing mean error variance in generalizability studies when resource constraints are imposed.


Sign in / Sign up

Export Citation Format

Share Document