Abstract

Multi-task learning of object detection and box-level segmentation is commonly formulated as an implicit feature modulation method, which suffers lacking task interaction. In this paper, a novel explicit feature modulation solution with two different mask infusion methods is proposed. The first modulation method is semantic feature enhancement of backbone, which is achieved by a novel mask-guided mutual attention module (MMA). The proposed MMA module can explicitly guide the feature maps towards a more semantic informative direction for focalizing centrality of pedestrian, which can significantly improve the performance. The second modulation method is confidence score enhancement of detection head, which is benefited from our proposed mask-guided score fusion module (MSF). The proposed MSF module collects information from the classification, IOU, centerness feature map and the learned mask, which can discriminate false and true positives more effectively. It is qualitatively validated that the modulated feature maps in both backbone and detection head become more semantically meaningful and robust to scale and occlusion. Our method achieves a considerable gain over the state-of-the-arts on the KAIST, CVC14 and FLIR datasets. Besides, it runs at 22 FPS in default setting, making it favorable in many practical scenarios.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.