Introduction/Background The need for interprofessional collaboration is clear: healthcare providers must work on cross-professional teams to coordinate patient care and reduce the propensity for errors.1 While interprofessional education (IPE) programs recently became mandatory for MD-granting education programs,*researchers have noted the dearth of published evaluation tools for directly observing and measuring collaboration competencies.2 Interprofessional simulations provide an important opportunity to meet this need for individual assessment and program evaluation. However, current limitations in the teamwork-simulation assessment literature include: 1) little data on sources of error variance affecting reliability, and 2) failure to include ambulatory (outpatient) teamwork scenarios (i.e., everyday teamwork aside from emergency, surgical and hospital settings). We have developed a broader use ambulatory teamwork simulation3 and now evaluate the reliability of a competency assessment for this scenario. In particular, we used generalizability theory to estimate reliability estimates for sub-samples of raters, thereby identifying individual raters who are less closely calibrated with others. *ED-19-A was approved July 1, 2013; http://www.lcme.org. Methods Six observers rated collaborative behaviors for each of twenty one students on the interprofessional collaboration simulation in a fully crossed design. Each simulation involves a single student interacting with a standardized patient and caregiver and standardized clinicians, all portrayed by live actors. The outpatient scenario challenges students to place the patient at the center of the healthcare team, to include the caregiver on that team and to work collaboratively with other health care providers.3 Observers rated student performance (from video recordings) on 21 behaviors including collaboration, communication and acting as an advocate for the patient. The 21 student performance ratings were analyzed using generalizability theory.4 Results Mean average scores were calculated for each student across the six raters. Student scores ranged from 8.67 to 16.3 out of 21 possible, with a mean of 12.6 (SD = 2.1). The following variance components were estimated with the SPSS VARCOMP procedure: Student (1.7% of total variance), Rater (1.2%), Item (15.7%), Student x Rater (2.5%), Rater x Item (7.4%), Student x Item (36%) and the Student x Item x Rater term plus residual error (35.5%). The G coefficient for 6 raters and 21 fixed items was .78. Most of the systematic variance was attributed to student differences in performance on different competencies (items); overall student scores were a relatively small source of variance. A small proportion of variance in ratings was due to rater differences. A decision study was conducted to estimate the reliability expected with different numbers of raters. Using the above variance components, G coefficients for 2 to 7 raters were estimated, ranging from .53 to .80. In addition, we estimated the G coefficient by randomly extracting one of the study’s raters from the sample at a time in order to determine if any particular raters were less well calibrated with the others. These “reliability if rater deleted” estimates showed modest variability (.71, .71, .75, .76, .79, .81), in the latter two cases increasing beyond the .78 value expected for six raters and thus contradicting the expected benefit of increasing numbers of raters. Conclusion Results indicate the measurement protocol could yield a reliable assessment of ambulatory interprofessional collaboration; however, multiple raters were needed –at least five and as many as seven to achieve the desired .80. While adding raters is generally expected to improve reliability, it is unlikely that all raters are equally well calibrated and we observed increases when two specific raters were excluded. Other researchers may find value in the use of generalizability theory to diagnose rater-specific issues affecting reliability. Based on such statistics, researchers might elect to offer extra training or exclude raters who are less calibrated with the group.
Read full abstract