Real-Time Weakly Supervised Object Detection Using Center-of-Features Localization

Hatem Ibrahem,Hyun-Soo Kang,Ahmed Diefy Ahmed Salem

doi:10.1109/access.2021.3064372

Abstract

We propose a high-speed convolutional neural network approach for weakly supervised localization (WSL) and weakly supervised object detection (WSOD). The proposed method, called center-of-features localization (COFL), performs localization of objects in a visual scene by combining both multi-label classification and regression for the number of instances of each class object. A modified Xception network architecture is used as the main feature extractor, and a classification-plus-regression loss function is used to perform the detection task. The method does not require bounding box annotations but only image labels and counts of the objects of each class in the image. This combination can produce a clear localization of objects in the scene through a masking technique between class activation maps (CAMs) and regression activation maps (RAMs). The proposed method was trained and tested on the PASCAL VOC2007 and VOC2012 datasets; it attained a mean average precision (mAP) of 47.0% and a correct localization CorLoc of 64.1% on PASCAL VOC2007 and a mAP of 42.3% and a CorLoc of 65.5% on PASCAL VOC2012 while performing object detection at a speed of ~50 fps. These results demonstrate that the network can perform object detection accurately in real-time using only image labels and object counts, which are inexpensive to annotate compared with the bounding box annotations typically employed in fully supervised object detection methods. The network far outperforms other weakly supervised methods and some fully supervised methods in terms of processing time while achieving fair accuracy.

Highlights

O BJECT detection is a computer vision task that is in high demand today
Supervised learning can solve the problem of expensive annotations from which fully supervised methods suffer, while it can enable a network to learn the semantic features of each class of interest using an image-label level of annotation, which is inexpensive
We consider the small regions as the object centers since the maximum value for each region is considered as the center of the features, whereas visualization of the multi-label class activation maps indicates the extension of objects in the horizontal and vertical axes

Summary

Introduction

O BJECT detection is a computer vision task that is in high demand today. It is used in hundreds of applications, such as those in self-driving vehicles, modern robots, medical equipment, and VR/AR. As most of the successful methods are fully supervised, they rely on bounding box annotations, which are highly expensive and time consuming to perform. These drawbacks make fully supervised methods inconvenient for training with custom datasets, as convolutional neural networks require a large number of training images as well as manual annotation for each object in the images. The challenge in weakly supervised learning is to find the best way to form the bounding box for each class’s object at an appropriate speed of processing; the problem becomes even more complex when there is occlusion between objects of the same class

Methods

Findings

Discussion

Conclusion