Using generalizability theory as a guide, this study discusses statistical problems and strategies of analyzing rating data involving multiple raters--a common type of data issue frequently encountered in social work evaluations. To disentangle raters' bias from clients' true change, the study shows the importance of looking into the multifaceted structure of measurement error. To analyze data containing nonnegligible variability associated with raters, this study proposes using a three-level hierarchical linear model. It demonstrates that the three-level model produces a better model fit to the data, smaller sample residual, and more accurate significance testing than the popular two-level model when analyzing rating data with nonnegligible raters' influences. Key words: generalizability; hierarchial linear model; data; multiple raters; seriously emotionally disturbed children Monitoring client progress and evaluating the effectiveness of a program intervention are of central interest to social work research. Studies with these objectives often involve designs and collection of subjective rating data. Analyzing such data, however, is not as straightforward as investigators may wish. In a inquiry, research questions often address the change of client outcomes over time. Because change is inevitably a function of measurement, disentangling raters' bias from the clients' true change is crucial to most data analyses. This article discusses statistical problems and strategies in analyzing one type of rating data, namely, subjective ratings made by multiple raters at two or more time points. By this definition, we exclude the methodologically less complicated case of self-administered data from consideration. Examples of multiple-rater data include teachers' and parents' ratings of children's behavioral functioning, case workers' and service consumers' ratings on a program outcome, and so forth. Longitudinal rating data are likely to be collected by more than one rater. In a study, keeping track of study participants over time is expensive. By the same token, using a group of same raters (interviewers) requires additional resources that may be practically infeasible. Just like the attrition of study participants, raters may change jobs and make themselves no longer a part of the data-collection team. During the past two decades, the number of studies has grown at a fast pace, primarily because more social worker-researchers have recognized the importance and advantages of a dynamic inquiry. (We searched on the keyword longitudinal from the database Social Work Abstracts between 1977 and 1997, and found 542 journal articles in total. Among these articles, 191 [or more than one third] were published during the past seven years. Longitudinal studies often were used in the areas of program evaluations, such as assessment of programs preventing drug and alcohol abuse, mental health treatment services, welfare, foster care, and adoption programs; and correlates studies, such as factors associated with poverty, homelessness, children's aggressive behaviors, teenage pregnancy, living arrangement and service use for elderly people, HIV epidemic, and so forth.) However, the design of such studies has not received equally important attention. A typical scenario is the application of an instrument that was primarily designed for a cross-sectional study to other time points. Although the instrument has proven to be valid and reliable at a static time point, its usefulness to dynamic inquiry remains questionable. By doing so, the researcher may underestimate the importance of choosing the optimal number of raters and time points to achieve maximum generalizability. In this study we addressed the following questions: How can researchers analyze rating data containing multiple sources of measurement error? …