Abstract

Hypothesis The relative advantages and disadvantages of checklists (CL) and global rating scales (GRS) have long been debated.1-2 CL are favored for their ease of use and apparent objectivity, while GRS may be more sensitive to expert performance and more reliable. However, CL and GRS have never been compared in a systematic review of validity evidence. To address this gap, we conducted a systematic review and metaanalysis of validity evidence for checklists and global ratings scales used to assess health professionals in the context of simulation-based medical education. We evaluated the inter-rater, inter-item, and inter-station reliabilities; between-scale correlations; and other sources of validity evidence.3-4 Methods From the studies published in an earlier systematic review of simulationbased assessment5 we identified all studies evaluating both a GRS and a CL. We coded new information on GRS-CL correlation and the reliabilities for each scale, instrument (items, stations), raters (number, training, and blinding) and other validity evidence3-4 (i.e., appropriate content, internal structure, relationships with other variables, response process and consequences). We pooled correlation and reliability coefficients using random effects meta-analysis. Results We found 45 relevant studies. All studies included physicians or physicians in training; One study also included nurse anesthetists. Topics included open and laparoscopic surgery (N=22), endoscopy (N=8), resuscitation (N=7) and anesthesiology (N=4). Only 22 studies described rater training. The same rater completed both CL and GRS in 39 studies. The pooled GRS-CL correlation was 0.76 (95% CI, 0.69-0.81). Interrater reliability was similar for GRS and CL, while inter-item and inter-station reliability were higher for GRS (see Figure). Most articles provided evidence supporting content (GRS,N=38; CL,N=41).Content evidence forGRS usually referenced unchanged (N=18) or modified (N=15) previously-reported instruments, whereas content evidence for CL usually involved expert consensus (N=26). Forty-four studies reported relationships with other variables, usually an evaluation of expert-novice differences (N=36). Evidence of response process and consequences was rare (two studies each). Conclusion We found high GRS-CL correlations explaining on average 58% of the variance in scores, although this could be due, in part, to the same rater completing both scales. Inter-rater reliabilities for both GRS and CL were high and similar, while inter-item and inter-station reliabilities favored GRS. Content validity evidence was commonly reported but the processes of tool development for GRS and CL appear to be quite different.While our findings are supportive of GRS and CL as useful formats, they do not necessarily generalize to all GRS or all CL. New GRS and CL, and applications of old instruments in new contexts, must be validated afresh.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.