Abstract
With the development of deep neural networks, object detection has made a significant leap. However, apart from intrinsic intra and inter class differences, the objects in real-world scenarios may encounter diverse scales, blurred appearance or even severe occlusion. Inspired by the human behavior of looking at blurred images, i.e., zooming in/out, we propose a Mixed-Scale Network (dubbed for MSNet) to tackle these issues. Specifically, MSNet employs the zoom strategy to exploit mixed-scale contextual information, which fully unleashes the representation ability of deep neural networks. Firstly, the global feature aggregation module and global feature enhancement module aims at aggregating mixed-scale features from the global perspective, which learns the transform offset of pixels to align the higher-res features contextually. Moreover, the local feature aggregation module enriches the instance-level feature by adaptively aligning the features of different receptive fields. Extensive experiments on MS COCO demonstrated the effectiveness of MSNet, yielding significant improvements of 0.7–2.8 points in APbox over baselines when paired with different detectors, backbones, and schedules. In addition, for pixel-level prediction tasks including semantic segmentation and instance segmentation, the proposed method gains consistent improvements of 0.6–2.7 points in mIoU/APmask over baselines, which substantiates the effectiveness of MSNet further.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have