The affectability of writing assessment scores: a G-theory analysis of rater, task, and scoring method contribution

AbstractThe present study attempted to to investigate factors which affect EFL writing scores through using generalizability theory (G-theory). To this purpose, one hundred and twenty students participated in one independent and one integrated writing tasks. Proceeding, their performances were scored by six raters: one self-rating, three peers,-rating and two instructors-rating. The main purpose of the sudy was to determine the relative and absolute contributions of different facets such as student, rater, task, method of scoring, and background of education to the validity of writing assessment scores. The results indicated three major sources of variance: (a) the student by task by method of scoring (nested in background of education) interaction (STM:B) with 31.8% contribution to the total variance, (b) the student by rater by task by method of scoring (nested in background of education) interaction (SRTM:B) with 26.5% of contribution to the total variance, and (c) the student by rater by method of scoring (nested in background of education) interaction (SRM:B) with 17.6% of the contribution. With regard to the G-coefficients in G-study (relative G-coefficient ≥ 0.86), it was also found that the result of the assessment was highly valid and reliable. The sources of error variance were detected as the student by rater (nested in background of education) (SR:B) and rater by background of education with 99.2% and 0.8% contribution to the error variance, respectively. Additionally, ten separate G-studies were conducted to investigate the contribution of different facets across rater, task, and methods of scoring as differentiation facet. These studies suggested that peer rating, analytical scoring method, and integrated writing tasks were the most reliable and generalizable designs of the writing assessments. Finally, five decision-making studies (D-studies) in optimization level were conducted and it was indicated that at least four raters (with G-coefficient = 0.80) are necessary for a valid and reliable assessment. Based on these results, to achieve the greatest gain in generalizability, teachers should have their students take two writing assessments and their performance should be rated on at least two scoring methods by at least four raters.

Download Full-text

Effect on Performance of Timely Feedback on State Writing Assessments

Psychological Reports ◽

10.2466/pr0.2003.92.3.1015 ◽

2003 ◽

Vol 92 (3) ◽

pp. 1015-1021 ◽

Cited By ~ 2

Author(s):

Kevin D. Crehan ◽

Mary Curfman

Keyword(s):

Writing Assessment ◽

State Department ◽

The State ◽

Scoring Method ◽

Writing Assessments ◽

Writing Performance ◽

Timely Feedback ◽

Teachers And Students ◽

Trained Teachers

The effect of timely feedback for a state writing assessment on subsequent writing performance was investigated. Also, agreement between teachers' scores on assessment and the state department's scores was compared. Eighth grade English teachers ( N = 8) were trained on an analytic scoring method which yielded scores on ideas, organization, voice, and conventions. September state writing assessments from the teachers' class were scored by the teachers who also scored assessments for a partner teacher's class. A second parallel writing assessment was administered in February to the trained teachers' classes and eight control classes. Analysis showed good agreement between the teachers' scores and those by the state department. There was 75% agreement on the designation of adequate or inadequate for the students' writing between the teachers and the state department. There was no difference between the writing performance for students of the trained teachers and students in the control classes on the follow-up assessment.

Download Full-text

Genetic and Environmental Influences on Temperament

European Psychologist ◽

10.1027//1016-9040.6.4.272 ◽

2001 ◽

Vol 6 (4) ◽

pp. 272-286 ◽

Cited By ~ 12

Author(s):

Bogdan Zawadzki ◽

Jan Strelau ◽

WŽodzimierz Oniszczenko ◽

Rainer Riemann ◽

Alois Angleitner

Keyword(s):

Genetic Factors ◽

Error Variance ◽

Diagnostic Methods ◽

Self Report ◽

Total Variance ◽

Joint Analysis ◽

Phenotypic Variance ◽

Same Sex ◽

Peer Rating ◽

The Impact

Among the criteria for a personality paradigm the following three are the most crucial: biological basis (e. g., genetic contribution to the phenotypic variance), universality (existence of traits in different cultures) and reality (possibility to measure traits by different methods). The present study combines all three criteria to explore the impact of genetic and environmental factors on temperamental traits, as stipulated by Strelau 's regulative theory of temperament, across two culturally different samples (Polish and German), and by means of two diagnostic methods (self-report and peer-rating). The analysis was conducted on data obtained from 1009 same-sex pairs of twins (German sample) and 546 same-sex pairs of twins (Polish sample). For each subject the self-report as well as rating from two independent peers was recorded by both the Polish and German versions of the Formal Characteristics of Behavior-Temperament Inventory. Results demonstrate substantial heritability of temperamental traits, although average peer-rating tends to provide lower heritability estimates than self-report (for six traits M = 33% and M = 46% of the total variance, respectively). After separating the error variance from the effect of nonshared environment for both methods (self-report and peer-rating) and both samples, joint analysis indicated a very high impact of genetic factors (the average raised up to 66% of the total variance). No significant “sample” effect was found, so that we may allows to conclude that temperamental traits are determined in both cultures to the same extent by genetic factors.

Download Full-text

Revision of a Criterion-Referenced Vocabulary Test Using Generalizability Theory

JALT Journal - JALT Journal 24.1 ◽

10.37546/jaltjj31.1-4 ◽

2009 ◽

Vol 31 (1) ◽

pp. 81

Author(s):

Takeaki Kumazawa

Keyword(s):

Generalizability Theory ◽

Classical Test Theory ◽

Performance Testing ◽

Error Variance ◽

Test Theory ◽

Vocabulary Test ◽

Multiple Sources ◽

Generalizability Coefficient ◽

Criterion Referenced ◽

Generalizability Study

Classical test theory (CTT) has been widely used to estimate the reliability of measurements. Generalizability theory (G theory), an extension of CTT, is a powerful statistical procedure, particularly useful for performance testing, because it enables estimating the percentages of persons variance and multiple sources of error variance. This study focuses on a generalizability study (G study) conducted to investigate such variance components for a paper-pencil multiple-choice vocabulary test used as a diagnostic pretest. Further, a decision study (D study) was conducted to compute the generalizability coefficient (G coefficient) for absolute decisions. The results of the G and D studies indicated that 46% of the total variance was due to the items effect; further, the G coefficient for absolute decisions was low. 古典的テスト理論は尺度の信頼性を測定するため広く用いられている。古典的テスト理論の応用である一般化可能性理論（G理論）は特にパフォーマンステストにおいて有効な分析手法であり、受験者と誤差の要因となる分散成分の割合を測定することができる。本研究では診断テストとして用いられた多岐選択式語彙テストの分散成分を測定するため一般化可能性研究（G研究）を行った。さらに、決定研究（D研究）では絶対評価に用いる一般化可能性係数を算出した。G研究とD研究の結果、項目の分散成分が全体の分散の46%を占め、また信頼度指数は高くなかった。

Download Full-text

Automatic analysis of summary statements in virtual patients - a pilot study evaluating a machine learning approach

BMC Medical Education ◽

10.1186/s12909-020-02297-w ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Inga Hege ◽

Isabel Kiesewetter ◽

Martin Adler

Keyword(s):

Language Processing ◽

Virtual Patient ◽

Global Rating ◽

Virtual Patients ◽

Immediate Feedback ◽

Assessment Scores ◽

Reasoning Abilities ◽

Reliable Assessment ◽

Machine Learning Approach ◽

Computer Based

Abstract Background The ability to compose a concise summary statement about a patient is a good indicator for the clinical reasoning abilities of healthcare students. To assess such summary statements manually a rubric based on five categories - use of semantic qualifiers, narrowing, transformation, accuracy, and global rating has been published. Our aim was to explore whether computer-based methods can be applied to automatically assess summary statements composed by learners in virtual patient scenarios based on the available rubric in real-time to serve as a basis for immediate feedback to learners. Methods We randomly selected 125 summary statements in German and English composed by learners in five different virtual patient scenarios. Then we manually rated these statements based on the rubric plus an additional category for the use of the virtual patients’ name. We implemented a natural language processing approach in combination with our own algorithm to automatically assess 125 randomly selected summary statements and compared the results of the manual and automatic rating in each category. Results We found a moderate agreement of the manual and automatic rating in most of the categories. However, some further analysis and development is needed, especially for a more reliable assessment of the factual accuracy and the identification of patient names in the German statements. Conclusions Despite some areas of improvement we believe that our results justify a careful display of the computer-calculated assessment scores as feedback to the learners. It will be important to emphasize that the rating is an approximation and give learners the possibility to complain about supposedly incorrect assessments, which will also help us to further improve the rating algorithms.

Download Full-text

Convergent and Discriminant Validity of Assessment Center Dimensions: A Conceptual and Empirical Reexamination of the Assessment Center Construct-Related Validity Paradox

Journal of Management ◽

10.1177/014920630002600410 ◽

2000 ◽

Vol 26 (4) ◽

pp. 813-835 ◽

Cited By ~ 8

Author(s):

Winfred Arthur ◽

David J. Woehr ◽

Robyn Maldegen

Keyword(s):

Discriminant Validity ◽

Generalizability Theory ◽

Assessment Center ◽

Total Variance ◽

Future Research ◽

Factor Analyses ◽

Confirmatory Factor Analyses ◽

Research And Practice ◽

Convergent And Discriminant Validity ◽

Confirmatory Factor

This study notes that the lack of convergent and discriminant validity of assessment center ratings in the presence of content-related and criterion-related validity is paradoxical within a unitarian framework of validity. It also empirically demonstrates an application of generalizability theory to examining the convergent and discriminant validity of assessment center dimensional ratings. Generalizability analyses indicated that person, dimension, and person by dimension effects contribute large proportions of variance to the total variance in assessment center ratings. Alternately, exercise, rater, person by exercise, and dimension by exercise effects are shown to contribute little to the total variance. Correlational and confirmatory factor analyses results were consistent with the generalizability results. This provides strong evidence for the convergent and discriminant validity of the assessment center dimension ratings–a finding consistent with the conceptual underpinnings of the unitarian view of validity and inconsistent with previously reported results. Implications for future research and practice are discussed.

Download Full-text

Measuring motivational relationship processes in experience sampling: A reliability model for moments, days, and persons nested in couples

Behavior Research Methods ◽

10.3758/s13428-021-01701-7 ◽

2021 ◽

Author(s):

Felix D. Schönbrodt ◽

Caroline Zygar-Hoffmann ◽

Steffen Nestler ◽

Sebastian Pusch ◽

Birk Hagemeyer

Keyword(s):

Variance Components ◽

Generalizability Theory ◽

Relative Proportion ◽

Experience Sampling ◽

Process Models ◽

Sampling Studies ◽

Reliable Assessment ◽

Relationship Processes ◽

Level Scale ◽

Variance Estimates

AbstractThe investigation of within-person process models, often done in experience sampling designs, requires a reliable assessment of within-person change. In this paper, we focus on dyadic intensive longitudinal designs where both partners of a couple are assessed multiple times each day across several days. We introduce a statistical model for variance decomposition based on generalizability theory (extending P. E. Shrout & S. P. Lane, 2012), which can estimate the relative proportion of variability on four hierarchical levels: moments within a day, days, persons, and couples. Based on these variance estimates, four reliability coefficients are derived: between-couples, between-persons, within-persons/between-days, and within-persons/between-moments. We apply the model to two dyadic intensive experience sampling studies (n1 = 130 persons, 5 surveys each day for 14 days, ≥ 7508 unique surveys; n2 = 508 persons, 5 surveys each day for 28 days, ≥ 47764 unique surveys). Five different scales in the domain of motivational processes and relationship quality were assessed with 2 to 5 items: State relationship satisfaction, communal motivation, and agentic motivation; the latter consists of two subscales, namely power and independence motivation. Largest variance components were on the level of persons, moments, couples, and days, where within-day variance was generally larger than between-day variance. Reliabilities ranged from .32 to .76 (couple level), .93 to .98 (person level), .61 to .88 (day level), and .28 to .72 (moment level). Scale intercorrelations reveal differential structures between and within persons, which has consequences for theory building and statistical modeling.

Download Full-text

A Systemic Functional Linguistic Analysis of Cohesion and The Writing Quality of Saudi Female EFL Undergraduate Students

10.31235/osf.io/zjx9k ◽

2019 ◽

Author(s):

Khawater Fahad Alshalan

Keyword(s):

Undergraduate Students ◽

Writing Assessment ◽

Writing Quality ◽

Lexical Cohesion ◽

Cohesive Devices ◽

Systemic Functional Linguistic ◽

Efl Students ◽

Qualitative Data Analysis Software ◽

Writing Scores

This study aims to investigate how frequently Halliday and Hasan’s (1976) cohesive devices were used as well as their relationship with the writing quality of 100 Saudi EFL undergraduate students in Al Imam Muhammed Ibn Saud Islamic University, Riyadh, Saudi Arabia. It uses a mixed method approach, where the students’ essays were analyzed using systemic functional linguistics (SFL) in terms of the textual meta-function of cohesive devices. The five types of the cohesive devices are the following: lexical cohesion, reference, conjunction, substitutions, and ellipses. Moreover, each of their subcategories were analyzed in the students’ texts. The NVivo qualitative data analysis software and the corpus analysis (conducted using AntConc) were used to calculate the frequencies of each cohesive device found in the data. The IELTS writing assessment scale was also used to evaluate the students’ writing scores. The results show that the most frequently used device was lexical cohesion, specifically repetition. Saudi EFL undergraduate students tended to repeatedly stay focused on the central idea of the topic. Furthermore, Pearson’s correlation coefficient found a relationship between the students’ writing scores and length of their essays, the use of cohesive ties and the scores, and cohesive ties and the length of the students’ essays. This study recommends that EFL teachers provide Saudi EFL students several cohesive tools in order to help them improve their writing skills and connect their ideas smoothly.

Download Full-text

Applying Generalizability Theory to the Self-Compassion Scale to Examine State and Trait Aspects and Generalizability of Assessment Scores

Mindfulness ◽

10.1007/s12671-020-01522-3 ◽

2020 ◽

Author(s):

Oleg N. Medvedev ◽

Anastasia T. Dailianis ◽

Yoon-Suk Hwang ◽

Christian U. Krägeloh ◽

Nirbhay N. Singh

Keyword(s):

Generalizability Theory ◽

The Self ◽

Assessment Scores ◽

Self Compassion ◽

Compassion Scale

Download Full-text

Acquiring a Stable Estimate of Physical Activity in Adults With Visual Impairment

Adapted Physical Activity Quarterly ◽

10.1123/apaq.30.1.59 ◽

2013 ◽

Vol 30 (1) ◽

pp. 59-69 ◽

Cited By ~ 13

Author(s):

Elizabeth A. Holbrook ◽

Minsoo Kang ◽

Don W. Morgan

Keyword(s):

Physical Activity ◽

Visual Impairment ◽

Generalizability Theory ◽

Total Sample ◽

Time Frame ◽

Total Variance ◽

Adapted Physical Activity ◽

Stable Estimate

As a first step toward the development of adapted physical activity (PA) programs for adults with visual impairment (VI), the purpose of this study was to determine the time frame needed to reliably estimate weekly PA in adults with VI. Thirty-three adults with VI completed 7 days of pedometer-based PA assessment. Generalizability theory analyses were conducted to quantify sources of variance within the PA estimate and determine the appropriate number of days of PA monitoring needed for the total sample and for participants with mild-to-moderate and severe VI. A single-facet, crossed design was employed including participants and days. Participants and days correspondingly accounted for 33–55% and 0–3% of the total variance in PA. While a reliable account of PA was obtained for the total sample over a 6-day period, shorter (4-day) and longer (9-day) periods were required for persons with mild-to-moderate and severe VI, respectively.

Download Full-text

Maximizing Power in Generalizability Studies Under Budget Constraints

Journal of Educational Statistics ◽

10.3102/10769986018002197 ◽

1993 ◽

Vol 18 (2) ◽

pp. 197-206 ◽

Cited By ~ 11

Author(s):

George A. Marcoulides

Keyword(s):

Error Control ◽

Resource Constraints ◽

Generalizability Theory ◽

Error Variance ◽

Test Reliability ◽

Simple Task ◽

Mean Error ◽

Measurement Design ◽

The Mean ◽

Statistical Issues

Generalizability theory provides a framework for examining the dependability of behavioral measurements. When designing generalizability studies, two important statistical issues are generally considered: power and measurement error. Control over power and error of measurement can be obtained by manipulation of sample size and/or test reliability. In generalizability theory, the mean error variance is an estimate that takes into account both these statistical issues. When limited resources are available, determining an optimal measurement design is not a simple task. This article presents a methodology for minimizing mean error variance in generalizability studies when resource constraints are imposed.

Download Full-text