Abstract

Recent years have witnessed the effective of attention network based on two-stream for video action recognition. However, most methods adopt the same structure on spatial stream and temporal stream, which produce amount redundant information and often ignore the relevance among channels. In this paper, we propose a channel-wise spatial attention with spatiotemporal heterogeneous framework, a new approach to action recognition. First, we employ two different network structures for spatial stream and temporal stream to improve the performance of action recognition. Then, we design a channel-wise network and spatial network inspired by self-attention mechanism to obtain the fine-grained and salient information of the video. Finally, the feature of video for action recognition is generated by end-to-end training. Experimental results on the datasets HMDB51 and UCF101 shows our method can effectively recognize the actions in the video.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.