Abstract
Object detection is one of the core tasks in computer vision. Object detection algorithms often have difficulty detecting objects with diverse scales, especially those with smaller scales. To cope with this issue, Lin et al. proposed feature pyramid networks (FPNs), which aim for a feature pyramid with higher semantic content at every scale level. The FPN consists of a bottom-up pyramid and a top-down pyramid. The bottom-up pyramid is induced by a convolutional neural network as its layers of feature maps. The top-down pyramid is formed by progressive up-sampling of a highly semantic yet low-resolution feature map at the top of the bottom-up pyramid. At each up-sampling step, feature maps of the bottom-up pyramid are fused with the top-down pyramid to produce highly semantic yet high-resolution feature maps in the top-down pyramid. Despite significant improvement, the FPN still misses small-scale objects. To further improve the detection of small-scale objects, this paper proposes scale adaptive feature pyramid networks (SAFPNs). The SAFPN employs weights chosen adaptively to each input image in fusing feature maps of the bottom-up pyramid and top-down pyramid. Scale adaptive weights are computed by using a scale attention module built into the feature map fusion computation. The scale attention module is trained end-to-end to adapt to the scale of objects contained in images of the training dataset. Experimental evaluation, using both the 2-stage detector faster R-CNN and 1-stage detector RetinaNet, demonstrated the proposed approach’s effectiveness.
Highlights
Object detection is one of the most important problems in computer vision
For an image containing largerscale objects, on the other hand, the lower resolution feature map would be weighted more than the others. e adaptive weighting is learned by the scale attention module (SAM), which is trained end-to-end with the other parts of the object detection network. e proposed approach, scale adaptive feature pyramid networks (SAFPNs), is versatile in which it applies to many different object detection architectures, including both 1-stage and 2-stage networks for object detection
To evaluate the proposed scale adaptive feature pyramid network (SAFPN), we conduct a set of experiments over the following five variations of networks: C5 C4 C3 C2 C1
Summary
Object detection is one of the most important problems in computer vision. In recent years, object detection accuracy has improved greatly by using deep convolution neural network (CNN) [1]. Representative single-stage detectors include SSD [6], several incarnations of YOLO ([7–9]), and RetinaNet [10] Most of these methods use a DCNN such as VGG [1] or ResNet [12] as their “backbone” network to extract features to detect object area, or bounding box, and to classify object in the bounding box. Some object detection algorithms, such as faster R-CNN, uses a feature map at a fixed resolution level for the task. E single resolution feature map of the faster R-CNN (e.g., C5 in Figure 1(a)) has high semantic content induced by later CNN layers of the backbone network. One could use earlier layers of the backbone CNN as feature maps for object area detection and classification These earlier layers of feature maps (e.g., C1 or C2 in Figure 1(a)) have lower semantic content while having higher resolution. These earlier layers of feature maps (e.g., C1 or C2 in Figure 1(a)) have lower semantic content while having higher resolution. is leads to inaccurate object area (bounding box) and inaccurate classification labels of objects in the area
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have