Abstract

The two-stream convolutional network has been proved to be one milestone in the study of video-based action recognition. Lots of recent works modify internal structure of two-stream convolutional network directly and put top-level features into a 2D/3D convolution fusion module or a simpler one. However, these fusion methods cannot fully utilize features and the way fusing only top-level features lacks rich vital details. To tackle these issues, a novel network called Diverse Features Fusion Network (DFFN) is proposed. The fusion stream of DFFN contains two types of uniquely designed modules, the diverse compact bilinear fusion (DCBF) module and the channel-spatial attention (CSA) module, to distill and refine diverse compact spatiotemporal features. The DCBF modules use the diverse compact bilinear algorithm to fuse features extracted from multiple layers of the base network that are called diverse features in this paper. Further, the CSA module leverages channel attention and multi-size spatial attention to boost key information as well as restraining the noise of fusion features. We evaluate our three-stream network DFFN on three public challenging video action benchmarks: UCF101, HMDB51 and Something-Something V1. Experiment results indicate that our method achieves state-of-the-art performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.