Abstract

Target detection algorithms for Internet of things (IoT) devices often require both high real-time performance and low computational complexity. Real-time object detection network: You Only Look Once Version 3 (YOLOv3) makes full use of multi-scale features to detect objects by using feature pyramid network structure, and achieves good performance on the premise of guaranteeing fast detection speed. The feature pyramid network of YOLOv3 includes bottom-up feature extraction, top-down sampling and lateral connection of low-level detail features and high-level semantic features. But not all features are useful for object detection. In this article, a novel object detection network Spatial Attention based YOLOv3 (SA-YOLOv3) is proposed. The proposed method adds spatial attention network to the top-down sampling process. The spatial attention network calculates the feature weight matrix based on the up-sampling feature map. SA-YOLOv3 uses the feature weight matrix to filter low-level features and retain more valuable features. Finally, the selected low-level feature map and high-level feature map are concatenated together and feature maps with both spatial information and rich semantic information are obtained. The experimental results on PASCAL VOC2012 datasets and RSOD datasets show that SA-YOLOv3 outperforms YOLOv3.

Highlights

  • Object detection is a key step for Internet of things (IoT) devices to realize intelligent perception and recognition of objects and processes, such as autonomous driving [1], [2] and robot vision [3]

  • Our contributions in this work are as follows: (1) We proposed Spatial Attention based You Only Look Once Version 3 (YOLOv3) (SA-YOLOv3) network

  • In this article, we proposed Spatial Attention based YOLOv3 network (SA-YOLOv3)

Read more

Summary

INTRODUCTION

Object detection is a key step for IoT devices to realize intelligent perception and recognition of objects and processes, such as autonomous driving [1], [2] and robot vision [3]. The other branch multiplies the low-level detail feature map L by the weight matrix Z extracted from the location network to generate the attention map Y of the spatial domain. In this process, A2, B2, and C2 layers correspond the output feature maps of three DBL × 5 modules in the multi-scale detection network, respectively. The weight matrix W1 is multiplied by the low-level detail feature map B1 in Darknet 53, and the map SA1_out (size 26 × 26 × 512) is obtained as the output of SA1 network. The SA2_out and the high-level semantic feature map are concatenated and generate the output feature map C2 of the SA-block module

EXPERIMENTS
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.