Abstract

Weakly supervised object localization (WSOL) strives to localize objects with only image-level supervision. WSOL often faces challenges such as incomplete localization due to classifier bias and over-localization in real scenes where objects and backgrounds are strongly associated or structurally similar. While the latest Transformer-based methods effectively enhance localization performance by leveraging long-range feature dependencies, they may inadvertently amplify divergent background activation and remain susceptible to classification bias. To this end, we proposed a novel Semantic-Constraint Matching (SeCM) plug-in module tailored for transformer-based approaches. In detail, a local patch shuffle strategy is first introduced to disentangle partial contextual linkages, thereby creating image pairs. Then a semantic matching module extracts co-object knowledge from the primal-shuffled image pairs, drives the network to identify the association of foreground with semantic label to suppress divergent activation. Moreover, to alleviate incomplete localization and prevent excessive suppression of activation, we propose leveraging multi-modal class-specific textual representations to guide object localization by complementing intra-class priori diverse knowledge. Extensive experimental results conducted on CUB-200-2011 and ILSVRC datasets show that our method can achieve the new state-of-the-art performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.