Abstract

In the field of object detection, there is often a high level of occlusion in real scenes, which can very easily interfere with the accuracy of the detector. Currently, most detectors use a convolutional neural network (CNN) as a backbone network, but the robustness of CNNs for detection under cover is poor, and the absence of object pixels makes conventional convolution ineffective in extracting features, leading to a decrease in detection accuracy. To address these two problems, we propose VFN (A Vision Enhancement and Feature Fusion Multiscale Detection Network), which first builds a multiscale backbone network using different stages of the Swin Transformer, and then utilizes a vision enhancement module using dilated convolution to enhance the vision of feature points at different scales and address the problem of missing pixels. Finally, the feature guidance module enables features at each scale to be enhanced by fusing with each other. The total accuracy demonstrated by VFN on both the PASCAL VOC dataset and the CrowdHuman dataset is better than that of other methods, and its ability to find occluded objects is also better, demonstrating the effectiveness of our method.The code is available at https://github.com/qcw666/vfn.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call