High Levels Of Inter-rater Reliability Research Articles

GRADUATE medical education is one of the core missions of academic medical centers, wherein medical specialists are responsible for teaching and supervising their future colleagues. However, being a medical specialist is no longer a sufficient qualification or proxy for competence in medical education aimed at training residents. This is particularly true given the modernization requirements for competency-based teaching and training promoted by accreditation institutions in some countries (such as the Accreditation for Council for Graduate Medical Education in the United States). These modernization efforts accelerate faculty development of clinicianeducators needed to achieve and maintain the highest standard of postgraduate medical education. An effective faculty development track should include measuring medical teaching effectiveness. This requires valid and reliable instruments, as well as providing the findings in a clear and concise format to faculty. Several studies have found that systematic and constructive feedback can result in improved teaching. There are few published and validated evaluation systems or even instruments aimed at supporting the graduate medical education qualities of clinical faculty. In anesthesiology, there are few published instruments and systems, and the existing ones tend to focus on faculty evaluation by residents only without any self-evaluations by faculty. To ensure actual behavioral change, individuals must usually undergo a stepwise change process. Evaluation insights obtained from feedback should be followed by creating positive intentions to change, trying out new behaviors and integrating them into practice. Supporting this change process has been shown to be effective. To support the specialty-specific evaluation of teaching qualities of anesthesiology faculty in an academic medical center, we developed the System for Evaluation of Teaching Qualities (SETQ) comprising (1) a Webbased self-evaluation by faculty, (2) a Web-based residents’ evaluation of faculty, (3) individualized faculty feedback, and (4) individualized faculty follow-up support. This paper has three main objectives: (1) to investigate the psychometric properties of the two instruments underlying the SETQ system, (2) to explore the relationship between residents’ evaluation and faculty self-evaluation, and (3) to gauge the feasibility of reliably using residents’ evaluation of faculty by estimating the number of such evaluations needed per faculty. We also place these objectives in context by describing SETQ. SETQ was initially developed in the anesthesiology department of a large academic medical center that has over 7,000 staff (including about 500 faculty and 400 residents) in the Netherlands. It was later expanded to include specialty-specific modules for internal medicine, surgery, and obstetrics and gynecology. At the time of writing, most of the remaining specialties have signed up for SETQ, resulting in more than 90% faculty coverage in 2009. SETQ is receiving nationwide attention.

Sir: While we welcome technical skills assessment in plastic surgery and concur that Khan et al.1 have established construct validity for their three assessment tasks, we question the conclusion that “the tasks are valid and can be used … for competence assessments and revalidation.” Objective Structured Assessment of Technical Skills2 comprises a global rating score and a checklist for each task used. The global rating scale is a subjective assessment tool, and it is very difficult to train assessors to the reliability level of greater than 0.8 required for high-stakes assessment. We find it curious that the team decided to abandon half the Objective Structured Assessment of Technical Skills results (i.e., the checklist results) altogether in their final analysis. The results of the electromagnetic tracking system when suturing was assessed are also surprising. The researchers state that the greater range of suturing times among the consultants was attributable to the fact that “consultants perform the task to a higher quality but are also more efficient with their timing.” How they drew this incorrect conclusion is unclear, especially as previous studies from one of the senior authors have concluded the reverse (i.e., senior or more experienced surgeons demonstrate a more homogenous performance).3 In their statistical analysis, this group persists in the erroneous idea that correlation is equivalent to reliability. Correlation is merely a measure of association, not agreement. Table 1 exemplifies the problem. The performance of 12 residents was independently evaluated by two assessors (assessors 1 and 2) using the Objective Structured Assessment of Technical Skills 5-point Likert scale. They agreed on the scores for residents 1 through 3, but disagreed on the scores of residents 4 through 9. The correlation between their scores is 0.949 (p < 0.0001) and the alpha coefficient is 0.967 (p < 0.0001), indicating (erroneously) a high level of interrater reliability when in fact they only agreed on the scores of 25 percent of the residents. The same is true for assessors 3 and 4, but the situation is slightly worse, as they have not agreed on the score of a single resident, although the correlation between their scores is 1.0 (p < 0.0001) and the alpha coefficient is 0.1.0 (p < 0.0001), indicating total agreement between raters. This issue has been quantitatively explored in some detail,4 and it is therefore disappointing that high-profile groups continue to make the same errors.Table 1: Hypothetical Data from Four Assessors of 12 Residents’ Performance on a Given Surgical TaskIf the plastic surgery community is to consider using technical skills assessment as part of important, career-defining decisions, such as resident selection, competence assurance, and recredentialing, we must ensure that the test battery applied and statistical analysis used are well validated by the methods that were agreed upon and documented in 1974 by the leading authorities in this area5 (i.e., the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education) and are beyond reproach. Unfortunately, this is not the case in this study. We suggest that more rigorous statistical analysis and refinement of Objective Structured Assessment of Technical Skills is required to justify the conclusion that the assessments described are a worthwhile method of testing technical skill. DISCLOSURE None of the authors has a financial interest in any of the products or devices mentioned in the original article. Ann-Marie Kennedy, M.R.C.S. National Surgical Training Centre Royal College of Surgeons Sean Carroll, F.R.C.S.I.(Plast.) St. Vincent’s University Hospital Oscar Traynor, F.R.C.S. Anthony G. Gallagher, Ph.D. National Surgical Training Centre Royal College of Surgeons Dublin, Ireland

High Levels Of Inter-rater Reliability Research Articles

Articles published on High Levels Of Inter-rater Reliability

Magnetic resonance scanning and image segmentation procedure at 3 T for volumetry of human hippocampal subfields

The Blooming Anatomy Tool (BAT): A discipline-specific rubric for utilizing Bloom's taxonomy in the design and evaluation of assessments in the anatomical sciences.

A randomized trial of family focused therapy with populations at clinical high risk for psychosis: Effects on interactional behavior.

Assessing teacher practice and development: the case of comprehensive literacy instruction

The impact of playing in matches while injured on injury surveillance findings in professional football

Development of a scale to measure fidelity to manualized group-based cognitive behavioural interventions for people with intellectual disabilities

Development and testing of a direct observation code training protocol for elementary aged students with attention deficit/hyperactivity disorder

Inter-Rater Reliability and Criterion Validity of Scatter Diagrams as an Input Method for Marksmanship Analysis: Computerised Notational Analysis for Archery

Assessing Physician Leadership Styles: Application of the Situational Leadership Model to Transitions in Patient Acuity

Team performance in resuscitation teams: Comparison and critique of two recently developed scoring tools

<b>Elaboration and validation of an ESL/EFL software evaluation instrument</b><br>DOI:10.5007/2175-8026.2011n60p305

Interrater Reliability of the Assessment of DSM-IV Axis IV Psychosocial Stressors and Environmental Problems

Interrater Reliability of Videotaped Performance on the Functional Movement Screen Using the 100-Point Scoring Scale

Identifying and improving unreliable items in registries through data auditing

Design and Feasibility of an International Study Assessing the Prevalence of Contact Allergy to Fragrances in the General Population: The European Dermato-Epidemiology Network Fragrance Study

Development of a System for the Evaluation of the Teaching Qualities of Anesthesiology Faculty

Selective Control Assessment of the Lower Extremity (SCALE): development, validation, and interrater reliability of a clinical tool for patients with cerebral palsy

Assessing Surgical Skill Using Bench Station Models

The vulvalgesiometer as a device to measure genital pressure-pain threshold

Interbedömarreliabilitet – Ett Tillförlitligt Mått På Standardiserade Intervjuer?

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High Levels Of Inter-rater Reliability Research Articles

Articles published on High Levels Of Inter-rater Reliability

Magnetic resonance scanning and image segmentation procedure at 3 T for volumetry of human hippocampal subfields

The Blooming Anatomy Tool (BAT): A discipline-specific rubric for utilizing Bloom's taxonomy in the design and evaluation of assessments in the anatomical sciences.

A randomized trial of family focused therapy with populations at clinical high risk for psychosis: Effects on interactional behavior.

Assessing teacher practice and development: the case of comprehensive literacy instruction

The impact of playing in matches while injured on injury surveillance findings in professional football

Development of a scale to measure fidelity to manualized group-based cognitive behavioural interventions for people with intellectual disabilities

Development and testing of a direct observation code training protocol for elementary aged students with attention deficit/hyperactivity disorder

Inter-Rater Reliability and Criterion Validity of Scatter Diagrams as an Input Method for Marksmanship Analysis: Computerised Notational Analysis for Archery

Assessing Physician Leadership Styles: Application of the Situational Leadership Model to Transitions in Patient Acuity

Team performance in resuscitation teams: Comparison and critique of two recently developed scoring tools

&lt;b&gt;Elaboration and validation of an ESL/EFL software evaluation instrument&lt;/b&gt;&lt;br&gt;DOI:10.5007/2175-8026.2011n60p305

Interrater Reliability of the Assessment of DSM-IV Axis IV Psychosocial Stressors and Environmental Problems

Interrater Reliability of Videotaped Performance on the Functional Movement Screen Using the 100-Point Scoring Scale

Identifying and improving unreliable items in registries through data auditing

Design and Feasibility of an International Study Assessing the Prevalence of Contact Allergy to Fragrances in the General Population: The European Dermato-Epidemiology Network Fragrance Study

Development of a System for the Evaluation of the Teaching Qualities of Anesthesiology Faculty

Selective Control Assessment of the Lower Extremity (SCALE): development, validation, and interrater reliability of a clinical tool for patients with cerebral palsy

Assessing Surgical Skill Using Bench Station Models

The vulvalgesiometer as a device to measure genital pressure-pain threshold

Interbedömarreliabilitet – Ett Tillförlitligt Mått På Standardiserade Intervjuer?

<b>Elaboration and validation of an ESL/EFL software evaluation instrument</b><br>DOI:10.5007/2175-8026.2011n60p305