Classroom observation is an important component of teacher evaluation systems. Most states are implementing systems that assign a composite score to each teacher based on weights assigned to several different measures. Policy discussions often address this weighting, with many states adopting formulas with high weights for the summative scores from observations conducted by school principals or other administrators. Given the weighting of this one measure, it is important to ensure the validity of observation rubrics and equitability of the resulting teacher rankings. In this paper, we address the problem of observation scores being affected by characteristics of the students in the class being taught. We explore this in two phases. First, we demonstrate an alternative to the common (often implicit) assumption that the components or elements of the observation score are measuring a single underlying concept and all have the same relevance to any personnel decision that is to be based on the evaluation score. Second, we show how the multifaceted nature of observations can be used to better understand how observation scores are affected by class characteristics. Most observation rubrics in wide use, such as the Framework for Teaching (FFT), have been designed and are used as universal instruments. They are applied without any modifications in classrooms at different grades levels, in different subjects, and with students of widely different abilities, backgrounds, and resources. This implicit assumption of instrument invariance is however questionable. Furthermore, the nature of the invariance may be different for different components of the instrument. The goal of the analyses reported here is to provide a stronger basis for making observations a useful part of teacher evaluation by addressing these facets of variability. Several recent studies have pointed to the problems with the application of observation instruments in the context of teacher evaluation, in particular significant correlations between teachers’ observation scores and characteristics of classes they teach. Using the data collected by Measures of Effective Teaching (MET) project, Mihaly & McCaffrey (2014) reported negative correlations between teachers’ observation scores and grade level. They formulated several testable hypotheses concerning the causes of this but found empirical support for none of them. Lazarev and Newman (2013), using the same dataset, showed that relationships between observation and value-added scores vary by grade and subject. For example, observation items related to classroom management tend to be linearly related to value-added in the elementary school, but the relationship becomes non-linear in middle range of observation scores being correlated to value-added only for lower performing teachers. While the above-mentioned studies point to the problems with vertical alignment of observation scores, two recent studies that used data from local teacher evaluation systems elucidate issues with the use of an observation instrument within a single cohort. In particular, Whitehurst, Chingos, and Lindquist (2014) report a positive association between the teacher’s average observation score and the class-average pretest score, while Chaplin, Gill , Thompkins, and Miller (2014) report negative correlations between the score and class shares of minority and free lunch-eligible students. While the nature of these relationships remains unclear, these results can be interpreted as suggesting that teachers may benefit unfairly from being assigned a more able group of students. Observation scores therefore could be adjusted for the disparity in class characteristics to produce more robust results. Whitehurst et al. (2014) show that adjusting the observation scores for class characteristics reduces what they term “observation bias,” i.e., this operation reduces the differences in average observation scores between quintiles of classroom distribution of pretest scores. As a policy suggestion, however, such an adjustment may be inappropriate if teacher assignment is not random. If less proficient teachers are assigned to classes made up of lower-performing students or if schools serving low-income communities are less successful in retaining effective teachers, then such an adjustment would undermine the validity of an evaluation system by obscuring the real differences among teachers. Rigorous statistical correction for non-random teacher-class matching could be technically challenging and possibly not feasible at all because it would require collection of data beyond the scope of a teacher evaluation system. It is also possible that the observed empirical regularities result from a measurement problem. In pre-certification training courses, observers encounter a relatively small number of cases used in observer calibration exercises typically conducted in person or with video-recorded lessons used as examples of teaching practice. Adapting the underlying meaning of instrument categories to specifics of various classrooms may require more experience than can be obtained in the course of a single academic study or in one or two rounds of annual observation for evaluation purposes.
Read full abstract