Abstract

In this paper, we propose an approach to generate the comprehensive video interpretation for the surveillance video understanding in Internet of Things. The key problem of many visual learning tasks is to adaptively select and fuse diverse and complimentary features for video representation. We design the attention-in-attention (AIA) network to hierarchically explore the attention fusion in an end-to-end manner, and demonstrate the value of this model on the multievent recognition and video captioning challenges. Particularly, it consists of multiple encoder attention modules (EAMs) and a fusion attention module (FAM). Each EAM aims to highlight the space-specific features by selecting the most salient visual features or semantic attributes and averages them into one attentive feature. The FAM can suppress or enhance the activation of multispace attentive features and adaptively co-embed them for comprehensive video representation. Then, one long short-term memory unit decodes the video representations to generate multiple event labels or video captions. This architecture is capable of: 1) adaptively learning the salient space-specific feature representation and 2) co-embedding multispace attentive features into one space for feature fusion. Experiments conducted on the surveillance video dataset (concurrent event dataset) and the popular video captioning datasets (Microsoft Research Video Description Corpus and MSR-Video to Text). It shows that the proposed AIA can achieve competitive performances against the state of the arts.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.