Abstract

Since large-scale surveillance videos always contain complex visual events, how to generate video descriptions effectively and efficiently without human supervision has become mandatory. To address this problem, we propose a novel architecture for jointly recognizing multiple events in a given surveillance video, motivated by the sequence to sequence network. The proposed architecture can predict what happens in a video directly without the preprocessing of object detection and tracking. We evaluate several variants of the proposed architecture with different visual features on a novel dataset perpared by our group. Moreover, we compute a wide range of quantitative metrics to evaluate this architecture. We further compare it to the popular Support Vector Machine-based visual event detection method. The comparison results suggest that the proposal method can outperform the traditional computer vision pipelines for visual event detection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call