Abstract
The structure of a multi-head ensemble has been employed by many algorithms in various applications including deep metric learning. However, their structures have been empirically designed in a simple way such as using the same head structure, which leads to a limited ensemble effect due to lack of head diversity. In this paper, for an elaborate design of the multi-head ensemble structure, we establish design concepts based on three structural factors: designing the feature layer for extracting the ensemble-favorable feature vector, designing the shared part for memory savings, and designing the diverse multi-heads for performance improvement. Through rigorous evaluation of variants on the basis of the design concepts, we propose a heterogeneous double-head ensemble structure that drastically increases ensemble gain along with memory savings. In verifying experiments on image retrieval datasets, the proposed ensemble structure outperforms the state-of-the-art algorithms by margins of over 5.3%, 6.1%, 5.9%, and 1.8% in CUB-200, Car-196, SOP, and Inshop, respectively.
Highlights
Deep metric learning has been successfully used in various applications related to computer vision, such as image retrieval [1]–[10], person re-identification [11], [12], and face verification [13]–[15]
Deep metric learning refers to the design of feature extracting functions with deep neural networks so that the features of semantically similar images are close to others
The proposed Heterogeneous Double-head Ensemble (HDhE) gives a performance improvement of over 4% of Recall@1 (CUB-200) from the single-head version of HDhE (Sh-HDhE)
Summary
Deep metric learning has been successfully used in various applications related to computer vision, such as image retrieval [1]–[10], person re-identification [11], [12], and face verification [13]–[15]. Deep metric learning refers to the design of feature extracting functions with deep neural networks so that the features of semantically similar images are close to others. Ensemble is a method of ensuring robust performance by training diverse models and aggregating their prediction results. The ensemble requires two or more deep networks that extract diverse feature vectors [16]. A variety of feature vectors can be obtained by semantically diverse attention of the input image [11], [17] or by applying a loss to keep feature vectors away from others [2], [18]. The goal of the ensemble is to generate a synergy between those diverse
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.