Abstract

Temporal motion information plays an important role in video understanding, human action recognition and other fields. Optical flow, which contains rich temporal motion information, has been widely used in many visual tasks and has achieved superior performance. However, the extraction of optical flow is time-consuming and laborious. In this paper, we propose a Temporal Motion and Fusion (TMF) module, including a motion extraction (ME) module and a temporal crossing fusion (TCF) module. The ME module can replace the traditional optical flow, establish the matching relationship between adjacent frames on the convoluted feature maps. And then extract simple and effective short-term motion information. TCF module crosses adjacent frames and fuse the information of nonadjacent video frames to realize long-term motion information modeling. Finally, the extracted motion information is fused with the appearance information captured by 2D convolution for final recognition. The experiment proved that with only a few additional parameters and calculation costs increased, our proposed lightweight model achieves state-of-the-art results on Something-Something-V1&V2 and Diving-48, and obtains competitive results on HMDB-51 and UCF-101 among the single models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call