Subchondral bone marrow lesions (BMLs) are associated with symptoms and structural progression of knee OA. Automated detection of BMLs using deep learning approaches may help in screening potential participants in clinical trials and enrich study samples for fast structural progressors or for a specific pain phenotype. The metric commonly used to evaluate the performance of deep learning binary classification (i.e., receiver operating characteristic (ROC)) might not be as informative as other metrics of performance, especially when the underlying data used to train and validate the deep learning models are imbalanced as in the case when the outcomes of interest are rare. To compare the evaluation of deep learning binary classification of BMLs based on imbalanced data from the OAI study using various performance metrics. We used the sagittal intermediate-weighted (IW) fat-suppressed (FS) MRI data of 2,467 participants from the OAI study in the data analysis. We dichotomized the MOAKS (MRI Osteoarthritis Knee Score) BML grades (scored from 0-3) into presence or absence classes. The split was done by categorizing grades > 0 as presence and grades = 0 as absence. After the deep learning models were trained, we obtained the status of BMLs from MRI images on each of 13 subregions in femur and tibia (e.g., Femur Central Medial (FemCentMed), Tibia Anterior Lateral (TibAntLat), Tibia Posterior Medial (TibPostMed)). We applied ROC, precision-recall (PR), precision-recall gain (PRG), F1, and the Matthews correlation coefficient (MCC) to summarize the prediction performance of the deep learning models using the test data. The available MOAKS data from the OAI are imbalanced. The class imbalance ratios (i.e., presence of BMLs vs absence of BMLs) are 569:2427, 49:2947, and 191:2805 in the FemCentMed, TibAntLat, and TibPostMed, respectively. When the data are this severely imbalanced, metrics such as the area under the ROC curve (ROC-AUC) and PR-AUC show conflicting performance results in TibAntLat and TibPostMed (see Table 1). In general, a binary classifier with a ROC-AUC value of 0.8 to 0.9 is considered excellent and has an outstanding performance with a value of more than 0.9. The ROC metric (ROC-AUC = 0.84) is too optimistic since the precision and sensitivity are nearly zero, indicating that almost all data are assigned to the absence of BMLs class. The PR curve (PR-AUC = 0.10) is more informative compared to the ROC as it is consistent with the values of precision and sensitivity. The MCC and F1 results are also consistent with those of the PR curve for high- (TibAntLat) or low-class (FemCentMed) imbalance ratios. The class imbalance ratio coupled with results of the ROC, PR, and MCC should be reported for deep learning models of binary classification, particularly in the circumstance where the underlying data are imbalanced. To properly interpret the prediction performance of deep learning models of binary classification, an expanded set of performance metrics should be reported. None. AG is consultant to Pfizer, Novartis, Regeneron, TissueGene, Merck Serono, and AstraZeneca. AG and FWR are shareholders of BICL, LLC. FWR is consultant to Calibr –California Institute of Biomedical Research and Grunenthal. KK is consultant to Regeneron, LG Chem, and Express Scripts. He is principal investigator for pharma sponsored clinical trials to Abbvie, Cumberland, and GSK and DSMB to Kolon TissueGene and Avalor Therapeutics.
Read full abstract