Abstract
Purpose: Changes in cartilage thickness are predictive of radiographic joint-space loss and joint arthroplasty. While manual segmentation is the gold-standard for evaluating cartilage morphology, it is time-consuming and has high inter-reader variability. Advances in deep-learning and convolutional neural networks (CNNs) are promising for automatic tissue segmentation, however, the heterogeneity of datasets used for network evaluation have limited pervasive utilization of these techniques. To address these limitations, a segmentation challenge was organized at the 2019 International Workshop on Osteoarthritis Imaging (IWOAI). Here, we summarize the challenge submissions and discuss efficacy of diverse, multi-institutional deep-learning approaches for segmenting knee cartilage and meniscus. Methods: For the challenge, six teams trained CNNs to segment femoral cartilage, tibial cartilage, patellar cartilage, and menisci from 3D sagittal double-echo steady-state scans from the Osteoarthritis Initiative. The dataset consisted of 88 subjects scanned at two timepoints, split into cohorts of 60 for training, with baseline Kellgren-Lawrence grades (KLG) 1/2/3/4 distribution of (1,22,36,1), 14 for validation (1,4,8,1), and 14 for testing (0,5,8,1). Challenge participants were blinded to the all subject-identifying information. Approaches among all teams varied in CNN design and data augmentation methods, and are presented in a blinded manner below. Team 1 trained a multi-class 3D U-Net with dilated convolutions using a joint weighted cross-entropy and soft-Dice loss. Team 2 used a DeeplabV3+ architecture with dense convolutional blocks and a soft-Dice loss. Team 3 designed a multi-stage network built with a cascaded ensemble of 3D and 2D V-Nets, and used intensity and geometric transforms for data augmentation. Team 4 sampled 2D slices from multiple planes in the volume to train a 2D U-Net with batch normalization and nearest-neighbor upsampling. Team 5 used a generative adversarial framework to differentiate between real and generated 2D slices and 2D volumetric projections of segmentations that supervised the segmentation network. Following the challenge, a sixth submission (Team 6) utilized a simplified 2D, multi-class U-Net optimized with a soft-Dice loss. Dice overlap (Dice), volumetric overlap error (VOE), coefficient of variation (CV), and average symmetric surface distance (ASSD) assessed pixel-wise segmentation accuracy compared to expert-annotated ground truth. Cartilage thickness was computed for the automatic and manual approaches. Inter-network segmentation Dice overlaps were used to evaluate the similarity between different networks. Correlation between pixel-wise segmentation metrics (Dice, VOE, CV, and ASSD) and cartilage thickness error was measured using Pearson correlation coefficients (R). Statistical comparisons were performed using Kruskal-Wallis tests and Dunn post-hoc tests with Bonferroni correction (α=0.05). Results: All networks showed similar segmentation performance (violin plots Figure 1). No significant differences were observed in Dice, CV, VOE, ASSD for femoral cartilage (p=1.0), tibial cartilage (p=1.0), patellar cartilage (p=1.0), and menisci (p=1.0) among the four top-performing networks (Teams 1,3,4 and 6, respectively). Inter-network Dice overlaps were highest for femoral cartilage and above 0.85 for all tissues (Figure 2). There was no systematic bias or significant differences among a majority of the networks (p=0.99) for thickness estimates (Bland Altman plots in Figure 3). Correlation between pixel-wise segmentation accuracy metrics and cartilage thickness ranged from very-weak to moderate (highest R=0.41, thickness error vs segmentation metrics plot in Figure 4). Highest correlations were observed with femoral cartilage thickness (R less than 0.25), while very-weak correlation was observed with tibial cartilage (R less than 0.2). Conclusions: Despite the vast variety of network approaches, most methods achieved similar segmentation and thickness accuracy across all tissues, along with high inter-network Dice correlations. The similarity in performance and limitations may suggest that independent networks, regardless of their design and training framework, may learn to represent and segment the knee similarly. While networks performed comparably, there was variability in their thickness estimates. The correlation between standard segmentation metrics and cartilage thickness was weak, suggesting that traditional evaluation metrics on high-performing models may not be predictive of differences in thickness accuracy outcomes. Thus, through the segmentation challenge, we created a standardized and easy-to-use dataset to train and evaluate knee segmentation algorithms. Using deep-learning-based segmentation algorithms from multiple institutions, we showed that networks with varying training paradigms achieve similar performance and that amongst models achieving high segmentation performance, current segmentation accuracy metrics are weakly correlated with cartilage thickness endpoints.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.