The term (SE), as used in the United States, is currently reserved to describe a host of educational and related clinical activities for handicapped populations. As a service delivery system, SE is directed toward a wide array of mental, physical, and behavioral pupil characteristics. The diversity of administrative structures subsumed by the concept is such as to defy generalizations pertaining to salient defining criteria. However, both policymakers and practitioners tend to allude to SE as if to communicate a unified, standardized, and replicable organization and instructional delivery system. Similarly, the term Education Evaluation is frequently used to imply an attempt to draw conclusions about the SE enterprise in general - ignoring the variation that naturally exists in operationalizing SE among school districts. It is common to refer to SE programs by alluding to the categorical labels of the pupils they are designed to service. Hence, we find reference to Learning Disabilities (LD) programs, Mental Retardation (MR) programs, and the like. Added confusion exists in the practice of referring to such programs by using administrative variables as identifiers, while implicitly inferring instructional characteristics (e.g., Resource Room, Special Class, Mainstreaming programs). This latter practice apparently ignores the overlaps that exist in instructional practices across and within administrative arrangements (Semmel, Gottlieb, & Robinson, 1979). It suffices to assert that there is no current universal set of defining features that adequately describe pupil characteristics and instructional interventions operationally included under the rubric of special education in the United States. There is no currently acceptable, empirically or theoretically based scheme for uniquely matching specific instructional activities to unique characteristics of mentally and/or behaviorally handicapped pupils in the schools. This paper discusses the problems inherent in attempting to evaluate the SE enterprise rather than specific, local arrangements for instructing handicapped children. It argues first that these problems arise from essentially uncritical assumptions about the relationship between rules of inference in experimental science and evaluation of activities generated by public policy. Moreover, these evaluation problems are viewed as being maintained by an invalid application of what might be termed an ed-psychometric scientific paradigm (Shapiro, 1984). Classical psychometric orientations tend to limit evaluation research to