Not Just Generalizability: A Case for Multifaceted Latent Trait Models in Teacher Observation Systems

Teacher evaluation systems often include classroom observations in which raters use rating scales to evaluate teachers’ effectiveness. Recently, researchers have promoted the use of multifaceted approaches to investigating reliability using Generalizability theory, instead of rater reliability statistics. Generalizability theory allows analysts to quantify the contribution of multiple sources of variance (e.g., raters and tasks) to measurement error. We used data from a teacher evaluation system to illustrate another multifaceted approach that provides additional indicators of the quality of observational systems. We show how analysts can use Many-Facet Rasch models to identify and control for differences in rater severity, identify idiosyncratic ratings associated with various facets, and evaluate rating scale functioning. We discuss implications for research and practice in teacher evaluation.

Download Full-text

Hostility Reduction and Performance

Psychological Reports ◽

10.2466/pr0.1969.25.2.503 ◽

1969 ◽

Vol 25 (2) ◽

pp. 503-512 ◽

Cited By ~ 6

Author(s):

Morton Goldman ◽

Jonathan W. Keck ◽

Charles J. O'Leary

Keyword(s):

Teacher Evaluation ◽

Task Performance ◽

Rating Scale ◽

Control Treatment ◽

Classroom Performance ◽

Evaluation Form ◽

And Performance ◽

And Control

The study was concerned with the adequacy of several methods for reducing or preventing hostility toward a frustrating teacher and examined whether classroom performance was affected. Two cathartic methods, Rating Scale and Mutual Expression, and two non-cathartic methods, Explanation and Control, were induced. Residual hostility toward the teacher was measured by means of a Teacher Evaluation Form. Results showed that the Explanation method was most effective and the two cathartic methods were least effective in preventing or reducing residual hostility. The two cathartic methods actually increased residual hostility as compared to the Control treatment. Task performance efficiency varied directly with the level of residual hostility. Doubt is cast upon the catharsis hypothesis and a relationship between residual hostility and performance was found.

Download Full-text

Keeping Great Teachers: A Case Study on the Impact and Implementation of a Pilot Teacher Evaluation System

Educational Policy ◽

10.1177/0895904816637685 ◽

2016 ◽

Vol 32 (3) ◽

pp. 363-394 ◽

Cited By ~ 5

Author(s):

Claire Robertson-Kraft ◽

Rosaline S. Zhang

Keyword(s):

Teacher Evaluation ◽

Teacher Retention ◽

Evaluation System ◽

School Level ◽

Evaluation Systems ◽

Teacher Survey ◽

Teacher Evaluation Systems ◽

The Individual ◽

The Impact

A growing body of research examines the impact of recent teacher evaluation systems; however, we have limited knowledge on how these systems influence teacher retention. This study uses a mixed-methods design to examine teacher retention patterns during the pilot year of an evaluation system in an urban school district in Texas. We used difference-in-differences analysis to examine the impact of the new system on school-level teacher turnover and administered a teacher survey ( N = 1,301) to investigate individual and school-level factors influencing retention. This quantitative analysis was supplemented with interview data from two case study schools. Results suggest that, overall, the new evaluation system did not have a significant effect on teacher retention, but there was significant variation at the individual and school level. This study has important implications for policymakers developing new evaluation systems and researchers interested in evaluating their impact on retention.

Download Full-text

Better integrating summative and formative goals in the design of next generation teacher evaluation systems

Education Policy Analysis Archives ◽

10.14507/epaa.28.5024 ◽

2020 ◽

Vol 28 ◽

pp. 63

Author(s):

Timothy G. Ford ◽

Kim Hewitt

Keyword(s):

Teacher Evaluation ◽

Professional Growth ◽

Evaluation System ◽

Self Determination Theory ◽

Next Generation ◽

Policy And Practice ◽

Evaluation Policy ◽

Evaluation Systems ◽

Teacher Evaluation Policy ◽

Teacher Evaluation Systems

In current teacher evaluation systems, the two main purposes of evaluation—accountability/goal accomplishment (summative) and professional growth/improvement (formative)—are often at odds with one another. However, they are not only compatible, but linking them within a unified teacher evaluation system may, in fact, be desirable. The challenge of the next generation of teacher evaluation systems will be to better integrate these two purposes in policy and practice. In this paper, we integrate the frameworks of Self-determination theory and Stronge’s Improvement-Oriented Model for Performance Evaluation. We use this integrated framework to critically examine teacher evaluation policy in Hawaii and Washington, D.C.—two distinctly different approaches to teacher evaluation—for the purposes of identifying a set of clear recommendations for improving the design and implementation of teacher evaluation policy moving forward.

Download Full-text

An Analysis of Principal Perceptions of the Primary Teaching Evaluation System Used in Eight U.S. States

International Journal of Education Policy and Leadership ◽

10.22230/ijepl.2017v12n5a773 ◽

2017 ◽

Vol 12 (5) ◽

Cited By ~ 3

Author(s):

Richard L. Dodson

Keyword(s):

Teacher Evaluation ◽

Evaluation System ◽

Online Survey ◽

Teacher Evaluations ◽

Instructional Program ◽

Evaluation Instrument ◽

Public School Principals ◽

Evaluation Systems ◽

New Instrument ◽

Teacher Evaluation Systems

This research examines how public school principals in eight U.S. states perceive their teacher evaluation systems which are based on Charlotte Danielson’s Framework for Teaching (FfT). States were selected to represent high, middle, and low scorers in the annual Education Week “Quality Counts” report (Education Week, 2016). 1,142 out of over 8,100 working principals in the eight states responded to an online survey, yielding a response rate of over 14%. Most principals were not satisfied with FfT and found implementing the system too cumbersome. Responses suggested an average of two changes to FfT desired by each principal; few wanted to keep their FfT as is. Targets for improvement included overhauling software used to enter teacher evaluations; eliminating student growth goals and student test scores (VAMs) as part of evaluations; reducing the time and paperwork required; and wanting more training for administrators and teachers on the use of FfT. Some states’ principals wanted to return control over teacher evaluation systems to local school districts. Most respondents agreed that their version of FfT has improved their school’s instructional program, and they prefer the new instrument over their previous evaluation instrument.

Download Full-text

Declining Morale, Diminishing Autonomy, and Decreasing Value: Principal Reflections on a High-Stakes Teacher Evaluation System

International Journal of Education Policy and Leadership ◽

10.22230/ijepl.2018v13n8a813 ◽

2018 ◽

Vol 13 (8) ◽

Cited By ~ 3

Author(s):

Noelle A Paufler

Keyword(s):

Student Achievement ◽

Teacher Evaluation ◽

Evaluation System ◽

Negative Impact ◽

Value Added ◽

Unintended Consequences ◽

Evaluation Systems ◽

State And Local ◽

Teacher Evaluation Systems ◽

District Culture

Since the adoption of teacher evaluation systems that rely, at least in part, on controversial student achievement measures, little research has been conducted that focuses on stakeholders’ perceptions of systems in practice, specifically the perceptions of school principals. This study was conducted in a large urban school district to better understand principals’ perceptions of evaluating teachers based on professional and instructional practices as well as student achievement (i.e., value-added scores). Principals in this study strongly expressed concerns regarding: (a) the negative impact of the teacher evaluation system on district culture and morale; (b) their lack of autonomy in evaluating teachers and making staffing decisions; and (c) their perceived lack of value as professionals in the district. Examining the implications of teacher evaluation systems, per the experiences of principals as practitioners, is increasingly important if state and local policymakers as well as the general public are to better understand the intended and unintended consequences of these systems in practice.

Download Full-text

Enacting the Rubric: Teacher Improvements in Windows of High-Stakes Observation

Education Finance and Policy ◽

10.1162/edfp_a_00295 ◽

2019 ◽

pp. 1-51

Author(s):

Aaron R. Phipps ◽

Emily A. Wiseman

Keyword(s):

Teacher Evaluation ◽

Evaluation System ◽

Teacher Practices ◽

Causal Link ◽

High Stakes ◽

Evaluation Program ◽

Evaluation Systems ◽

Post Evaluation ◽

Teacher Evaluation Systems ◽

Teacher Evaluation Program

Teacher evaluation systems that use in-class observations, particularly in high-stakes settings, are frequently understood as accountability systems intended as non-intrusive measures of teacher quality. Presumably, the evaluation system motivates teachers to improve their practice – an accountability mechanism – and provides actionable feedback for improvement – an information mechanism. No evidence exists, however, establishing the causal link between an evaluation program and daily teacher practices. Importantly, it is unknown how teachers may modify their practice in the time leading up to an unannounced in-class observation, or how they integrate feedback into their practice post-evaluation, a question that fundamentally changes the design and philosophy of teacher evaluation programs. We disentangle these two effects with a unique empirical strategy that exploits random variation in the timing of in-class observations in the Washington, D.C. teacher evaluation program IMPACT. Our key finding is that teachers work to improve during periods in which they are more likely to be observed, and they improve with subsequent evaluations. We interpret this as evidence that both mechanisms are at work, and as a result, policymakers should seriously consider both when designing teacher evaluation systems.

Download Full-text

Using Generalizability Theory in the Evaluation of L2 Writing 一般化可能性理論を用いた高校生の自由英作文評価の検討

JALT Journal - JALT Journal 24.1 ◽

10.37546/jaltjj27.2-2 ◽

2005 ◽

Vol 27 (2) ◽

pp. 169 ◽

Cited By ~ 6

Keyword(s):

High School ◽

High School Students ◽

Variance Components ◽

Rating Scales ◽

Rating Scale ◽

Generalizability Theory ◽

Composition Profile ◽

L2 Writing ◽

Sufficient Information ◽

School Students

This paper aims to investigate the characteristics of the evaluation of L2 writing—particularly free English compositions by Japanese high school students—using Generalizability Theory (G theory). Although usually considered to be a difficult topic to examine, the evaluation of free compositions can be thoroughly investigated by using G theory. It enables researchers to provide sufficient information regarding the main effects and the interactions of complicated factors within an evaluation by examining its measurement errors. I focused on two factors (more specifically, facets) in order to obtain the data on the evaluation of free compositions. These facets were: (a) the raters—10 high school teachers (expert raters) teaching English at a national high school and two public high schools, and six university students (novice raters) studying English language education at a national university; and (b) the rating scales, which were Jacobs, Zinkgraf, Wormuth, Hartfiel, and Hughey’s (1981) ESL Composition Profile, and a modified version of Kantenbetsu Hyoka of the National Institute for Educational Policy Research (2002). Using these scales, the raters (expert and novice raters) evaluated free compositions written by 20 high school students studying at a national high school in the Chugoku region of Japan. The type of G theory design used in this paper is termed a two-facet crossed design (all the raters evaluate all the compositions using all the items of the rating scales). Studies using G theory are usually comprised of two substudies: a Generalizability Study (G study) and a Decision Study (D study). A G study investigates the manner in which the facets and their interactions (termed as sources of variance) affected the evaluation results by estimating the magnitude of variance components. A D study investigates the degree of reliability of the evaluation by examining generalizability coefficients, which correspond to classical test theory’s reliability coefficients, using simulations that vary the number of raters or items of the rating scales. The G study in this paper dealt with seven sources of variance—persons (p), raters (r), rating scale items (i), and their interactions (p x r, p x i, r x i, and p x r x i). The D study in this paper particularly focused on varying the number of raters for simulations. Several observations resulting from both the G study and the D study were as follows: (a) there was a halo effect tendency in the evaluations by the expert raters because the estimated variance components of the interactions of the sources of variance p x r and r x i were large; (b) the novice raters’ rating experience was insufficient to perform reliable evaluations because the generalizability coefficients of both of the rating scales were low, while the estimated variance component of the interaction of the sources of variance p x r x i, which is regarded as unmeasured error, was large; and (c) the ESL Composition Profile was a more reliable rating scale than the Kantenbetsu Hyoka as shown by the D study simulation results. This paper presentsseveral pedagogical implications based on the results with reference to improvement in the evaluation of free compositions. In particular, I have presented possible methods of diagnostically utilizing the results of G theory to develop and modify the rating scales, and to train the raters.

Download Full-text

The Consistency of Composite Ratings of Teacher Effectiveness: Evidence From New Mexico

American Educational Research Journal ◽

10.3102/0002831219841369 ◽

2019 ◽

Vol 56 (6) ◽

pp. 2116-2146 ◽

Cited By ~ 1

Author(s):

Sy Doan ◽

Jonathan D. Schweig ◽

Kata Mihaly

Keyword(s):

Teacher Evaluation ◽

Teacher Effectiveness ◽

Teacher Quality ◽

Evaluation System ◽

High Stakes ◽

Evaluation Systems ◽

Composite Rating ◽

Teacher Evaluation Systems ◽

Measures Of Performance ◽

Original Rating

Contemporary teacher evaluation systems use multiple measures of performance to construct ratings of teacher quality. While the properties of constituent measures have been studied, little is known about whether composite ratings themselves are sufficiently reliable to support high-stakes decision making. We address this gap by estimating the consistency of composite ratings of teacher quality from New Mexico’s teacher evaluation system from 2015 to 2016. We estimate that roughly 40% of teachers would receive a different composite rating if reevaluated in the same year; 97% of teachers would receive ratings within ±1 level of their original rating. We discuss mechanisms by which policymakers can improve rating consistency, and the implications of those changes to other properties of teacher evaluation systems.

Download Full-text

Teacher Performance Appraisal Regulation: A Policy Case Analysis

NASSP Bulletin ◽

10.1177/0192636520911197 ◽

2020 ◽

Vol 104 (1) ◽

pp. 20-33

Author(s):

Ed Dandalt ◽

Stephane Brutus

Keyword(s):

Teacher Evaluation ◽

Performance Appraisal ◽

Evaluation System ◽

Teacher Performance ◽

Evaluation Systems ◽

Evaluation Practices ◽

Empirical Inquiry ◽

Technical Requirements ◽

Teacher Evaluation Systems ◽

Teachers Perspectives

This article uses an analysis of the language used in the Teacher Performance Appraisal Technical Requirements Manual in Ontario to highlight some procedural issues. Arguably, the existence of flaws in the teacher evaluation system is not only limited to evaluation practices but is also embedded in evaluation regulations. Furthermore, the article provides a novel example of how a study of teacher evaluation systems can go beyond teachers’ perspectives of evaluation practices and can also consider teacher evaluation regulations as a source of empirical inquiry and a form of knowledge.

Download Full-text

An Analysis of Critical Issues in Korean Teacher Evaluation Systems

Center for Educational Policy Studies Journal ◽

10.26529/cepsj.93 ◽

2016 ◽

Vol 6 (2) ◽

pp. 151-171

Author(s):

Hee Jun Choi ◽

Ji-Hye Park

Keyword(s):

Teacher Evaluation ◽

Evaluation System ◽

Teacher Performance ◽

Analytical Framework ◽

Systematic Analysis ◽

Efficient System ◽

Evaluation Systems ◽

Advantages And Disadvantages ◽

Performance Based Pay ◽

Teacher Evaluation Systems

Korea has used three different teacher evaluation systems since the 1960s: teacher performance rating, teacher performance-based pay and teacher evaluation for professional development. A number of studies have focused on an analysis of each evaluation system in terms of its advent, development, advantages and disadvantages, but these studies have beencritically limited in that they have focused only on the partial integration of the three current teacher evaluation systems, without addressing the problems embedded in each of them. The present study provides a systematic analysis of the three current Korean teacher evaluation systems based on a sound analytical framework and proposes appropriate directions for designing an effective and efficient system. It is found that the three systems share commonalities in terms of stakeholders, evaluators, scope, criteria and methods, further supporting the rationale for developing a single comprehensive teacher evaluation system in Korea. Finally, several steps to establish a comprehensive teacher evaluation system based on the analysis results are suggested.

Download Full-text