Background:The prevention of joint destruction is an important goal in the management of rheumatoid arthritis (RA) and a key endpoint in drug trials. To quantify structural damage in radiographs, standardized scoring systems1, such as the Sharp/van der Heijde (SvdH) score2, which separately assesses joint space narrowing (JSN) and erosions, have been developed. However, application of these scores is time-consuming, requires specially trained staff, and results are subject to considerable intra- and inter-reader variability1. This makes their application poorly feasible in clinical practice and limits their reliability in clinical trials.Objectives:We aim to develop a fully automated deep learning-based scoring system of radiographic progression in RA to facilitate the introduction of quantitative joint damage assessment into daily clinical practice and circumvent inter-reader variability in clinical trials.Methods:5191 hand radiographs and their corresponding SvdH JSN scores from 640 adult patients with RA without visible joint surgery were extracted from the picture archive of a large tertiary hospital. The dataset was split, on a patient level, into training (2207 images/270 patients), validation (1150/133), and test (1834/237) sets. Joints were automatically localized using a particular deep learning model3which utilizes the local appearance of joints combined with information on the spatial relationship between joints. Small regions of interest (ROI) were automatically extracted around each joint. Finally, different deep learning architectures were trained on the extracted ROIs using the manually assigned SvdH JSN scores as ground truth (Fig. 1). The best models were chosen based on their performance on the validation set. Their ability to assign the correct SvdH JSN scores to ROIs was assessed using the unseen data of the test set.Fig. 1.3-step approach to automated scoring: joint localization, ROI extraction, JSN scoring.Results:ROI extraction was successful in 96% of joints, meaning that all structures were visible and joints were not malrotated by more than 30 degrees. For JSN scoring, modifications of the VGG164architecture seemed to outperform adaptations of DenseNet5. The mean obtained accuracy (i.e., the percentage of joints to which the human reader and our system assigned the same score) for MCP joints was 80.5 %, that for PIP joints was 72.3 %. In only 1.8 % (MCPs) and 1.7 % (PIPs) of cases did the predicted score differ by more than one point from the ground truth (Fig. 2).Fig. 2.Confusion matrices of automatically assigned scores (‘predicted score’) vs. the human reader ground truth (‘true score’).Conclusion:Although a number of previous efforts have been published, none have succeeded in replacing manual scoring systems at scale. To our knowledge, this is the first work that utilizes a dataset of adequate size to apply deep learning to automate JSN scoring. Our results are, even in this early version, in good agreement with human reader ground truth scores. In future versions, this system can be expanded to the detection of erosions and to all joints contained in the SvdH score.
Read full abstract