Abstract
We need to acquire labels for test sets to evaluate the performance of existing out-of-distribution (OOD) detection methods. In real-world deployment, it is laborious to label each new test set as there are various OOD data with different difficulties. However, we need to use different OOD data to evaluate OOD detection methods as their performance varies widely. Thus, we propose evaluating OOD detection methods on unlabeled test sets, which can free us from labeling each new OOD test set. It is a non-trivial task as we do not know which sample is correctly detected without OOD labels, and the evaluation metric like AUROC cannot be calculated. In this paper, we address this important yet untouched task for the first time. Inspired by the bimodal distribution of OOD detection test sets, we propose an unsupervised indicator named Gscore that has a certain relationship with the OOD detection performance; thus, we could use neural networks to learn that relationship to predict OOD detection performance without OOD labels. Through extensive experiments, we validate that there does exist a strong quantitative correlation, which is almost linear, between Gscore and the OOD detection performance. Additionally, we introduce Gbench, a new benchmark consisting of 200 different real-world OOD datasets, to test the performance of Gscore. Our results show that Gscore achieves state-of-the-art performance compared with other unsupervised evaluation methods and generalizes well with different in-distribution (ID)/OOD datasets, OOD detection methods, backbones, and ID:OOD ratios. Furthermore, we conduct analyses on Gbench to study the effects of backbones and ID/OOD datasets on OOD detection performance. The dataset and code will be available.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have