Context:The increasing use of artificial neural network (ANN) classifiers in systems, especially safety-critical systems (SCSs), requires ensuring their robustness against out-of-distribution (OOD) shifts in operation, which are changes in the underlying data distribution from the data training the classifier. However, measuring the robustness of classifiers in operation with only unlabeled data is challenging. Additionally, machine learning engineers may need to compare different models or versions of the same model and switch to an optimal version based on their robustness. Objective:This paper explores the problem of dynamic robustness evaluation for automated model selection. We aim to find efficient and effective metrics for evaluating and comparing the robustness of multiple ANN classifiers using unlabeled operational data. Methods:To quantitatively measure the differences between the model outputs and assess robustness under OOD shifts using unlabeled data, we choose distance-based metrics. An empirical comparison of five such metrics, suitable for higher-dimensional data like images, is performed. The selected metrics include Wasserstein distance (WD), maximum mean discrepancy (MMD), Hellinger distance (HL), Kolmogorov–Smirnov statistic (KS), and Kullback–Leibler divergence (KL), known for their efficacy in quantifying distribution differences. We evaluate these metrics on 20 state-of-the-art models (ten CIFAR10-based models, five CIFAR100-based models, and five ImageNet-based models) from a widely used robustness benchmark (RobustBench) using data perturbed with various types and magnitudes of corruptions to mimic real-world OOD shifts. Results:Our findings reveal that the WD metric outperforms others when ranking multiple ANN models for CIFAR10- and CIFAR100-based models, while the KS metric demonstrates superior performance for ImageNet-based models. MMD can be used as a reliable second option for both datasets. Conclusion:This study highlights the effectiveness of distance-based metrics in ranking models’ robustness for automated model selection. It also emphasizes the significance of advancing research in dynamic robustness evaluation.