Why do we assign numerical ratings when assessing complex performances? And what is the meaning and usefulness of those ratings given the nature of these performances and the multiplicity of assessment purposes? These were the key questions we grappled with in developing an assessment system to be used for national certification of accomplished teachers. Along the way, our work became entangled with philosophical, political, technical, and practical problems that led us into relatively uncharted territories. We worked with definitions of accomplished teaching, for example, as a domain of assessment in a profession influenced by competing and often radically different ideologies. Our goal was the creation of an assessment procedure to support certification decisions, to change teaching practice, and to evaluate complex performances in professionally, technically, and administratively acceptable ways. In the process, we uncovered some of the tensions that exist between the various stakeholders involved and were provided with opportunities to examine the broader context of assessment and its instrumental nature. This work also led us to examine some key assumptions that are held in the field of measurement and their implications for the multiple assessment purposes that we had to consider. The system we were developing was to be used by the National Board for Professional Teaching Standards (NBPTS) for the Early Adolescence / English Language Arts (EA/ELA) certification of exemplary teachers. The assessment included two components: (1) a portfolio, for which teachers documented three different teaching activities over several months and (2) an assessment center, for which teachers participated in a number of tasks, including semistructured interviews, analyses of teaching, and essays on instructional issues. While the particulars of the assessment are not critical here and can be found elsewhere (Delandshere & Petrosky, 1993,1994), it is important to understand its complexity. For example, just one of the portfolio tasks—the Post-Reading Interpretive Discussion Exercise (PRIDE)— required teachers to conduct and videotape a 20-minute interpretive discussion of a literature selection, to write a 3to 10-page commentary analyzing the discussion and their understanding of interpretation, and to include the instructional artifacts used or referred to in the videotape or the commentary. Overall, the candidates' performances involved a range of evidence: (1) sets of responses to tasks that required extensive written commentaries, lengthy videotape segments of their teaching, and videotaped oral interviews with candidates; (2) documents that were produced by the candidates and their students and documents acquired from other sources (e.g., books, instructional materials); and (3) a process through which candidates could integrate and reflect on these different perspectives that resulted in another written document. The assumptions we worked from were mostly related to the technical aspects of assessment and the necessity for evidence of reliability and validity. These assumptions were based on our prior experience and grounded in the measurement tradition. For most of this century, educational achievement or the status of an individual's knowledge has been judged through measurement—that is, by the assigning of numbers to test responses. The resulting scores are used to make value judgments about the quality of performances. After working for several years to develop evaluation schemes, we considered an alternative to the practice'' of assigning numerical ratings, which was to formulate judgments based directly on the characteristics of the performance. Such an alternative may be unnecessary when there is a one-to-one correspondence between the assignment of points and the number of correct responses, but the complexity and breadth of responses for this assessment appeared to defy such correspondence. To further complicate the matter, the tasks developed for this assessment were grounded in a professional ideology that values knowledge as individually and socially constructed and as reflected in particular discourses and contexts. This conception of performance is quite different from those implied in many assessment contexts and seemed to require an evaluation scheme more consistent with this representation of knowledge than with the traditional numerical scoring schemes. To this end, our procedure for judging used what we called interpretive summaries of performance, written records that document the salient characteristics of the performance and the judges' interpretations of those as evi-