Sir, Multiple observers are a reality of large observational and multicentre studies and introduce the challenge of addressing inter-rater reliability. Long term Outcomes in Psoriatic Arthritis II (LOPAS II) is a multicentre prospective observational study investigating work disability in PsA. The primary endpoint is presenteeism (reduced effectiveness at work), but the secondary endpoints include tender and swollen joint scores. Clinical assessments will be undertaken at multiple sites across the UK. We set out to undertake a reliability exercise to estimate joint count reliability in LOPAS II. We invited assessors from each centre to an education day at the lead site. A 1-h seminar on the study was followed by a 45-min clinical training session on joint counts lead by two trainers, each with >10 years experience in PsA joint assessment. The session was followed by a joint count reliability exercise. Four patients of differing disease duration (1–33 years) and activity (from 22 tender and 9 swollen to 2 tender and 2 swollen joints as assessed by the instructors) were assessed using a modified (asymmetrical) Latin square design. Reliability was measured using Krippendorff’s α, a reliability coefficient that accommodates the modified Latin square design [1]. Analyses were undertaken on the group as a whole and then repeated excluding those who self-reported to be unconfident or who had never performed joint assessments before. Twelve assessors from seven units attended: one doctor, seven rheumatology nurse specialists, one occupational therapist and three research nurses (of whom one had rheumatology experience). Reliability is reported in Table 1. Inter-rater reliability for all was low irrespective of experience, but was higher among those with experience. Table 1 Joint count reliability using Krippendorff’s α There are limited reports of joint count reliability among physicians with an interest in PsA [2–5]. Even among such experts, inter-class correlation coefficients are poor for determining peripheral joint swelling—0.13, 0.55, 0.242—and moderate for tenderness (or activity)—0.73, 0.75, 0.72. It is noteworthy that none of these studies has included the wider multidisciplinary team. To our knowledge, none of the recently published large observational studies or registry reports has reported on joint count reliability [6–10]. Only the Toronto research group has reported on joint count reliability, and this was at the time of the cohort’s inception [4]. The Toronto study involved three rheumatologists and two trainees assessing five patients in a Latin square design. There was a <1% observer variance, indicating good reliability of assessment. To enable direct comparison, the analysis of variance in our study showed that the proportion of variance attributable to (all) raters was much higher; swollen joints 56% (P = 0.094) and tender joints 60% (P = 0.004). It is noteworthy that the Toronto study was undertaken over 20 years ago and since that time the expansion of the multidisciplinary team has meant that clinical assessments are now performed by a wider clinical team including doctors, nurses and extended scope therapists. The general lack of reporting of joint count reliability may reflect a mixture of publication bias, insufficient recognition of the potential problem or misplaced confidence. The poor reliability identified in our study is important to our current study (LOPAS II) and also to assessors from centres in the UK and further afield who collect data for other large observational studies in PsA. The joint count training offered in the LOPAS II training day was minimal, as we had only anticipated the need for some fine tuning to standardize the assessment techniques. More training is required, as was mentioned by the assessors themselves in the feedback from the training day. We are attempting to standardize assessments and improve our reliability by using an instructional joint count training video as well as offering one-to-one tuition at the lead site. We are also encouraging a period of mentoring within each unit for those with less experience as well as aiming to use the same assessor to perform the joint counts at serial appointments. A repeat assessment day is planned once all centres have completed the training. To our knowledge, this is the first study investigating the joint count reliability among assessors routinely contributing data in the PsA clinical and research setting. We suggest that future reporting of joint count outcomes should include some assessment of joint count reliability in order to interpret results, particularly negative findings. Furthermore, we suggest that to optimize data collection, individual units document joint count reliability with a view to determining a potential training need.
Read full abstract