Abstract

Ultra-high-resolution aerial videos are used to relieve the shortage of surveillance system in sparsely populated regions. For realistic application purpose, it is important to automatically analyze “who is doing what?” in such videos. Although atomic visual action (AVA) detection has been successfully used to recognize “who is doing what?” in the movie data, it is challenging to adapt it to ultra-high-resolution aerial videos, where the target persons are relatively tiny and sparsely located. Besides, due to the lack of evaluation metrics, AVA detection has been evaluated by the single-label action; however, using multi-label actions in evaluation are more reasonable since several actions can be simultaneously performed by a person (e.g., making a phone call and walking). To tackle these issues, we propose a novel framework for multi-label AVA detection in ultra-high-resolution aerial videos and introduce novel metrics for multi-label AVA detection evaluation. The experimental results demonstrate that our framework outperforms other methods for interpreting “who is doing what?” in our target task.

Highlights

  • Surveillance cameras are commonly installed in city regions to increase public safety

  • We provide novel metrics for multi-label Atomic Visual Action (AVA) detection evaluation, which contributes to the general AVA detection studies

  • The concatenation of 2D CNN features and their multiplication with attention maps, are used to estimate multi-label action classes by a 3D ConvNet. It is a special multi-label AVA detection that serves for aerial surveillance videos

Read more

Summary

INTRODUCTION

Surveillance cameras are commonly installed in city regions to increase public safety. Aerial surveillance videos have some special properties and existing AVA detection methods may not work properly on them These special properties include: (1) to capture visual details from the sky, each frame. Some existing methods divide the entire aerial image into patches by a sliding window [6]–[8] Such methods have considerably improved object detection performance, they are inefficient when target objects are sparsely located. Since non-target objects might be included in the spatio-temporal tubes, action recognition performance could be affected To tackle this issue, we assume the target person can be consistently observed in his/her spatio-temporal tube while others may not. Our contributions include: (1) proposing a novel framework for multi-label AVA detection on aerial surveillance videos, which outperforms other methods in our experiments; (2) providing novel metrics for multi-label AVA detection evaluation. To the best of our knowledge, existing metrics cannot be applied to multi-label AVA detection, and we are the first to introduce such metrics

RELATED WORKS
EVALUATION METRICS FOR MULTI-LABEL
Nlabels
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call