Abstract

The dual-stream architecture is frequently employed for learning diverse features from videos. This paper introduces a novel Mixed Resolution Network (MixRes) for processing inputs with hybrid spatiotemporal resolutions, namely high-spatial and low-temporal resolution input, as well as low-spatial and high-temporal resolution input. The utilization of mixed spatiotemporal resolutions not only facilitates the independent emphasis of the two streams on appearance and motion encoding but also diminishes the computational burden. Furthermore, by leveraging the characteristics of neural networks with multiple layers, the temporal stream in the proposed network is divided into different steps to capture short-term and long-term motion information. Finally, we design a Temporal Multiscale Motion Excitation (TMME) module, which enhances the motion-related channels of the video representation by employing multiscale temporal differences. We conduct extensive experiments on multiple action recognition benchmarks, including Something-Something V1 & V2 and Kinetics-400. The outcomes validate that the proposed method achieves superior action recognition performance with low computational cost as compared to the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call