Sir: While we welcome technical skills assessment in plastic surgery and concur that Khan et al.1 have established construct validity for their three assessment tasks, we question the conclusion that “the tasks are valid and can be used … for competence assessments and revalidation.” Objective Structured Assessment of Technical Skills2 comprises a global rating score and a checklist for each task used. The global rating scale is a subjective assessment tool, and it is very difficult to train assessors to the reliability level of greater than 0.8 required for high-stakes assessment. We find it curious that the team decided to abandon half the Objective Structured Assessment of Technical Skills results (i.e., the checklist results) altogether in their final analysis. The results of the electromagnetic tracking system when suturing was assessed are also surprising. The researchers state that the greater range of suturing times among the consultants was attributable to the fact that “consultants perform the task to a higher quality but are also more efficient with their timing.” How they drew this incorrect conclusion is unclear, especially as previous studies from one of the senior authors have concluded the reverse (i.e., senior or more experienced surgeons demonstrate a more homogenous performance).3 In their statistical analysis, this group persists in the erroneous idea that correlation is equivalent to reliability. Correlation is merely a measure of association, not agreement. Table 1 exemplifies the problem. The performance of 12 residents was independently evaluated by two assessors (assessors 1 and 2) using the Objective Structured Assessment of Technical Skills 5-point Likert scale. They agreed on the scores for residents 1 through 3, but disagreed on the scores of residents 4 through 9. The correlation between their scores is 0.949 (p < 0.0001) and the alpha coefficient is 0.967 (p < 0.0001), indicating (erroneously) a high level of interrater reliability when in fact they only agreed on the scores of 25 percent of the residents. The same is true for assessors 3 and 4, but the situation is slightly worse, as they have not agreed on the score of a single resident, although the correlation between their scores is 1.0 (p < 0.0001) and the alpha coefficient is 0.1.0 (p < 0.0001), indicating total agreement between raters. This issue has been quantitatively explored in some detail,4 and it is therefore disappointing that high-profile groups continue to make the same errors.Table 1: Hypothetical Data from Four Assessors of 12 Residents’ Performance on a Given Surgical TaskIf the plastic surgery community is to consider using technical skills assessment as part of important, career-defining decisions, such as resident selection, competence assurance, and recredentialing, we must ensure that the test battery applied and statistical analysis used are well validated by the methods that were agreed upon and documented in 1974 by the leading authorities in this area5 (i.e., the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education) and are beyond reproach. Unfortunately, this is not the case in this study. We suggest that more rigorous statistical analysis and refinement of Objective Structured Assessment of Technical Skills is required to justify the conclusion that the assessments described are a worthwhile method of testing technical skill. DISCLOSURE None of the authors has a financial interest in any of the products or devices mentioned in the original article. Ann-Marie Kennedy, M.R.C.S. National Surgical Training Centre Royal College of Surgeons Sean Carroll, F.R.C.S.I.(Plast.) St. Vincent’s University Hospital Oscar Traynor, F.R.C.S. Anthony G. Gallagher, Ph.D. National Surgical Training Centre Royal College of Surgeons Dublin, Ireland