Background/Context Teacher evaluation is a major policy initiative intended to improve the quality of classroom instruction. This study documents a fundamental challenge to using teacher evaluation to improve teaching and learning. Purpose Using an observation instrument (CLASS-S), we evaluate evidence on different aspects of instructional practice in algebra classrooms to consider how much scores vary, how well observers are able to judge practice, and how well teachers are able to evaluate their own practice. Participants The study includes 82 Algebra I teachers in middle and high schools. Five observers completed almost all observations. Research Design Each classroom was observed 4–5 times over the school year. Each observation was coded and scored live and by video. All videos were coded by two independent observers, as were 36% of the live observations. Observers assigned scores to each of 10 dimensions. Observer scores were also compared with master coders for a subset of videos. Participating teachers also completed a self-report instrument (CLASS-T) to assess their own skills on dimensions of CLASS-S. Data Collection and Analysis For each lesson, data were aggregated into three domain scores, Emotional Support, Classroom Organization, and Instructional Support, and then averaged across lessons to create scores for each classroom. Findings/Results Classroom Observation scores fell in the high range of the protocol. Scores for Emotional Support were in the midlevel range, and the lowest scores were for Instructional Support. Scores for each domain were clustered in narrow ranges. Observers were more consistent over time and agreed more when judging Classroom Organization than the other two domains. Teacher ratings of their own strengths and weaknesses were positively related to observation scores for Classroom Organization and unrelated to observation scores for Instructional Support. Conclusions/Recommendations This study identifies a critical challenge for teacher evaluation policy if it is to improve teaching and learning. Aspects of teaching and learning in the observation protocol that appear most in need of improvement are those that are the hardest for observers to agree on, and teachers and external observers view most differently. Reliability is a marker of common understanding about important constructs and observation protocols are intended to provide a common language and structure to inform teaching practice. This study suggests the need to focus our efforts on the instructional and interactional aspects of classrooms through shared conversations and clear images of what teaching quality looks like.