Abstract

Though Two-stage video object detectors cannot perform detection in real time, the accuracy of them is normally higher than that of one-stage video object detectors. One essence is that two-stage video detectors can easily use feature information from adjacent frames to augment key frame features. How to extract and exploit temporal features in the video stream for one-stage detectors needs further exploration. CenterNet is an anchor-free one-stage object detector that regress bounding boxes from heatmap peaks. We propose to use detected peaks and regressed boxes which encompass peak points to determine the heatmap ROIs as the extracted object heatmap features. A new relation module is designed to evaluate the similarity of heatmap ROI features and output the relation features which can effectively augment the heatmap ROI features. In the video sequence the heatmap ROIs of multiple adjacent frames are aggregated to a heatmap ROI of the key frame. Compared to CenterNet and other CenterNet-based video object detectors, our method achieves improved online real-time performance on ImageNet VID dataset with 78.8% mAP at 36 FPS.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call