BackgroundIn medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews’ Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.ResultsSimulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI’s performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.ConclusionsUse of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device’s performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.