The procedures used to produce a research-based teaching evaluation system, containing low-inference indicators of effective and ineffective teacher behavior, included instrument development, content validation, and field testing. An extensive reliability study produced estimates for three types of consistency: intercoder agreement ( r = 0.85), stability across teaching situations r = 0.86, and discriminant reliability among teachers ( r = 0.79). The norming component of the study conducted in 45 schools, generated over 1200 observations of current teacher practice at all grade levels and in various subject areas. The results indicated the substantially generic nature of these behaviors; only grade level and instructional method (interactive, lecture, and independent seatwork) produced meaningful differences among groups of teachers. In addition, the average teacher used most of the effective behaviors (70%), and a few of the ineffective behaviors (8%) during observation. These results support the instrument's value in teacher training, remediation, and evaluation.