Abstract

For more than a decade, medical educators have employed standardized students and objective structured teaching examinations (OSTEs) to evaluate the clinical teaching skills of medical faculty.1,2,3,4,5 Recent studies have set more rigorous standards of validity and reliability for these performance-based assessments.5 Some have begun using OSTEs for resident physicians,6 whom the Liaison Committee for Medical Education (LCME) and others increasingly recognize as critically important teachers for medical students and peers.7 OSTEs hold great promise for rapid and rigorous evaluation of clinical teaching skills and of new approaches to teacher training. For resident teachers and clinician—educators, it may require years to accumulate sufficient numbers of “real life” teaching evaluations for reliable teaching assessments. OSTEs can truncate this timeline to produce meaningful, prompt teaching assessments for important decisions such as resident evaluations or faculty promotions. OSTEs also facilitate outcomes-based educational research, as well as program evaluation of novel initiatives to improve teaching skills. A major challenge in OSTE practice lies in developing accurate rating scale or checklist instruments appropriate for carefully assessing teaching performance on OSTE stations. Although earlier research has delineated characteristics of exemplary clinical teachers,8,9,10,11 it remains a challenge to translate this body of knowledge into sensitive and specific assessment instruments. Educational researchers have developed and studied numerous instruments,12 some tailored to evaluating residents' teaching skills.13 The SFDP-26, a 26-item rating scale based on the seven teaching constructs of the Stanford Faculty Development Program (SFDP),14 is one of the best-validated rating scales available to evaluate clinical teachers.15,16 The emerging OSTE literature has yet to address definitively the issue of selecting between dichotomously scored checklists and multi-point rating scales for best assessment of teaching performance. Research offers clearer support for using standardized students portrayed by senior medical students, who in non-OSTE studies have shown themselves to be capable evaluators of teaching.17 The related literature on objective structured clinical examinations (OSCEs) sheds light on some of these issues. Senior medical students who act as standardized patient examiners for learners18 may benefit by improving their own communication skills,19 suggesting that standardized students may improve their own teaching skills. The OSCE literature manifests more controversy over the choice between checklists and rating scales, with a minority of OSCEs featuring multi-point rating scales,20 although both formats can be used successfully.21 The purpose of our study was to develop and assess reliability and validity for an eight-station OSTE with case-specific, behaviorally-anchored rating scales, all developed specifically for resident teachers. This OSTE is the primary outcome measure for Bringing Education & Service Together (BEST), an ongoing randomized, controlled trial of a longitudinal residents-as-teachers curriculum at the University of California, Irvine (UCI). We hypothesized that our OSTE would demonstrate acceptable reliability and validity when used to evaluate generalist residents' clinical teaching skills before and after a pilot administration of the BEST curriculum. Method Instruments We reviewed the OSTE literature1,2,3,4,5,6 to guide development of a 3.5-hour, eight-station OSTE for generalist resident physicians, 〈www.residentteachers.com/oste.htm〉. Its stations (Table 1) each last 15 minutes. The stations reflect residents' learning needs for teaching skills development as reflected in the educational literature,22 and particularly as reported in a recent focus-group study of 100 medical students, generalist residents, and faculty23 completed for the BEST study's needs assessment.TABLE 1: Descriptive Statistics and Reliability of OSTE Rating Scales*Because our literature review revealed no single instrument suitably specific for evaluating performance on each OSTE station, we adapted the published SFDP-26 rating scale15 to our OSTE. We selected by group consensus the SFDP items that fit the objectives and content of each OSTE station, retaining each item stem's original wording and five-point Likert-type rating scale. We added only two additional item stems from Lesky and Wilkerson's previous OSTE study.2 Participants in the consensus process included three attending generalist physicians experienced in clinical teaching and an education specialist. The product was a set of eight case-specific rating scales,20 each featuring 14–24 items. Since the SFDP-26 was originally designed to evaluate clinical teachers after longitudinal exposure, we needed to hone our rating scales so that raters could accurately complete them after a single 15-minute exposure to each resident teacher. Each individual rating scale also needed to measure the unique, case-specific competencies that its station was designed to test. We wrote detailed, case-specific behavioral anchors for all item stems so that the anchored items match their stations' unique objectives while mirroring the appropriate underlying teaching construct of the SFDP24: learning climate, control of session, communication of goals, promoting understanding and retention, evaluation, feedback, and promoting self-directed learning. The anchors allow trained raters to determine whether they “strongly agree” that a resident demonstrates a given teaching behavior (rating of 5), “strongly disagree” (1), or prefer one of three intermediate levels (2–4). Each station's rating scale ends with the same SFDP-26 global item for rating “overall teaching effectiveness.” See Figure 1 for sample items. On all items, higher scores consistently indicate better teaching skills.Figure 1: Sample OSTE rating scale items, adapted with five different behavioral anchors (used in the pilot study, 2001–2002). The OSTE that includes these items is the primary outcome measure for Bringing Education & Service Together (BEST), an ongoing randomized controlled trial of a longitudinal residents-as-teachers curriculum at the University of California, Irvine, College of Medicine.Protocol Fifteen fourth-year medical students staffed the OSTEs after completing 30 hours of training as standardized students and raters through a longitudinal clinical teaching elective. Family medicine PGY1s, PGY2s, and PGY3s not enrolled in the BEST study underwent practice OSTE stations during the students' training. For the 59 second-year generalist residents enrolled in the four university-based UCI residency programs participating in the BEST study, the residency directors offered enrollment to 31 residents whose rotation schedules favored participation in the study. A total of 23 PGY2s enrolled: 13 in internal medicine, five in pediatrics, and five in family medicine. Eighteen of these residents underwent a pretest OSTE in August 2001. All 23 residents undertook a posttest OSTE in February 2002. Between the OSTEs, those residents randomly assigned to the intervention group (n = 13) attended a 13-hour teaching skills curriculum, the results of which will be reported separately. For most stations, one or two students enacted each case while another student watched by remote camera, with all students completing the rating scales for their stations. After the OSTEs, a rater-trained research psychologist [JH] rated occasional stations that did not already have two student ratings, so that all encounters within all stations were independently rated by at least two trained raters. An attending physician with medical education training [EM] completed additional ratings both to corroborate the students' ratings and to test the rating scales' validity, including rating one encounter from each pretest station. Analysis For each participating resident teacher, we calculated each station's summary score using the mean of all raters' scores and including the final item, “overall teaching effectiveness,” within the summary score. We also computed a grand total OSTE score for each resident, summed across the eight stations. We conducted an item analysis and calculated the reliability of the summary score from each OSTE station's rating scale using Cronbach's coefficient alpha. Intra-class correlations measured the inter-rater reliability for each station. Results Descriptive statistics are listed in Table 1. The standard error of measurement across all stations was 9.75. Reliability We evaluated the internal consistency reliability of our OSTE rating scales with Cronbach's coefficient alpha. This rating scale reliability, indicating the degree to which a resident's score on each rating scale item reflects a common underlying teaching construct, exceeded .90 (range = .91–.94) for all eight OSTE stations. The OSTE's mean overall reliability (Cronbach's alpha) was .96 across all stations and test administrations. In our item analysis, we correlated the mean score on each item with the mean summary score for that item's entire rating scale. Of the 160 total items, these item-total correlations were low enough for only three items (<2%) that deleting those items would even minimally increase the alpha coefficients for their rating scales. Thus, virtually every individual item contributes uniquely meaningful information to total scores. Only the global item (“overall teaching effectiveness”) repeats itself across all stations, and only 14 of the 160 items (<10%) cross seven stations. We did not test the effect of these few repeated items on OSTE scores or on reliability studies. Inter-rater reliabilities, calculated with intra-class correlations, exceeded .75 for seven of the eight OSTE stations (Table 1). One station (Station 7, teaching a procedure) had an overall inter-rater reliability of .54. Its inter-rater reliability was .19 for the OSTE's first pretest administration, .62 for the second pretest administration, and .78 across all posttest encounters. Had a single student rated each station, the average reliability would have dropped to an estimated intra-class correlation of .61. Including or excluding the research psychologist's ratings did not appreciably alter inter-rater reliabilities. The overall correlation between summary scores of the physician faculty rater and the medical student raters was .62, sampled across all eight stations. Validity We assessed the instruments' content validity by several means, including the detailed literature review. A large focus-group study involving learners and faculty23 further informed the objectives and content of the OSTE stations. Immediately following each posttest OSTE station, residents also completed an anonymous written evaluation. Among these 184 evaluations (23 residents × 8 stations), 92% indicated that each OSTE station realistically represented important teaching skills for generalist resident teachers. Five residents (22%) argued that the feedback station (Station 4) featured a student with an unrealistically difficult attitude, and four residents (17%) felt that the mini-lecture station (Station 8) needed either more or less time. We evaluated the predictive validity of our instruments with three methods. First, we assessed the incremental validity of the OSTE by examining pretest-to-posttest changes in overall OSTE scores for (1) the residents who received teaching skills instruction and (2) the residents who did not receive instruction. The instructed residents' scores improved by more than two standard deviations, while the noninstructed residents' scores did not improve, substantiating the instruments' sensitivity to instruction. Second, we calculated the reproducibility of the SFDP's seven teaching constructs across all OSTE stations and test administrations. These intra-class correlations ranged from .57 to .80. And third, as was expected, the experienced second-year residents in the BEST study performed better during the OSTE than did incoming family medicine interns during the practice OSTE training. Discussion Our results support the study's hypothesis that an OSTE tailored to generalist residents can ensure valid and reliable assessment of their clinical teaching skills. Inter-rater and rating scale reliabilities were high for our instruments, meeting experts' expectations for evaluation measures.25 Because the alpha coefficient for the combined score across all eight stations slightly exceeds the individual coefficient for each station, the overall pool of 160 items meaningfully reflects a single underlying construct, which the authors call “teaching skills.” While Station 7 (teaching a procedure) had an overall inter-rater reliability of only .54, we believe that this problem stemmed from a training issue because its medical student raters achieved an inter-rater reliability of .78 on the posttest after an additional 90 minutes of training. We also believe our analyses showed the OSTE and its rating scales to be valid instruments for assessing generalist residents' teaching skills. Content validity was strong, as assessed by residents and faculty before, during, and after the pilot study. We believe each rating scale successfully measures its station's unique teaching content; we opted to use detailed behavioral anchors to achieve this case specificity so that the rating scale item stems themselves retain fidelity to the previously-validated SFDP-26 instrument. Good incremental validity and reproducibility of the SFDP's seven clinical teaching constructs supported acceptable predictive validity. Because there were few “real life” teaching assessments of our residents during the study period, we could not conduct extensive construct validation of our measures. Our results support prior research showing that medical students can provide reliable and valid assessments of their clinical teachers.17 The standardized students in our OSTE—with adequate training and clear guidelines—used detailed rating scales consistently and competently. Checklists would have been simpler to use but might not have permitted the fine discrimination among multiple levels of teaching performance required by outcome studies such as our ongoing BEST trial. Correlations between students' ratings and those of the physician faculty rater were strong. Limitations in our study should be considered. Our sample was small. Even though we included residents from three specialties, an OSTE designed for generalist resident teachers might not apply well to other specialties. Our participating medical students spent many hours in training, mainly to ensure that all students were consistently using the rating scales' behavioral anchors to assess each station's unique set of teaching behaviors. We believe this effort was justified in helping the students prepare for their own future roles as resident teachers. Future research needs to continue exploring how OSTEs can best be used to help resident physicians achieve their goals as clinical teachers. Our OSTE would benefit from generalizability studies that analyze sources of variation in ratings. While our sample size in this pilot curricular study did not permit such analyses, we are currently undertaking a larger randomized, controlled trial that includes a generalizability study of the present OSTE, its primary outcome measure. Other questions deserve additional study: do OSTEs require three to four hours of testing time for acceptable reproducibility (as high-stakes OSCEs do),21 or can shorter examination formats offer adequate reliability? Can a single rater for each station achieve reproducibility comparable to that of multiple raters, as is the case with OSCEs?21 Can residents effectively use intra-OSTE feedback to improve clinical teaching skills, as they have done using student evaluations from actual teaching situations?26 Conclusion Trained senior medical students competently enacted and rated an OSTE for generalist resident teachers, achieving high inter-rater reliabilities with validated, case-specific rating scales whose alpha coefficients were high. Future research should clarify how OSTEs can best be used to help resident teachers and others to improve their clinical teaching skills.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call