Temporal Memory Network Towards Real-Time Video Understanding

Ziming Liu,Alex K Qin,Jinyang Li,Guangyu Gao

doi:10.1109/access.2020.3043386

Ziming Liu, Alex K Qin + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.3043386

Copy DOI

Abstract

Action recognition is the basic task for video understanding. Although the action recognition has achieved impressive performance in the static image-based task (e.g. Stanford40) with deep learning, real-time video-based action recognition is still a challenging task due to video’s high complexity and computation cost. Motivated by human’s recognition ability with only a short glance, we propose the fast light-weighted Temporal Memory Network (TMNet) to achieve real-time video action recognition. The TMNet has a self-supervised structure for exploring both spatial and temporal information with a single video frame. TMNet has three main parts, the base backbone, the regression branch, and the classification branch. Specifically, the base backbone network is a shallow 2D CNN network to obtain the video’s initial feature sequences. The classification branch is based on existing successful video recognition models(e.g. TSN, I3D). To make the TMNet learn the spatial-temporal information at a lower cost, we add a self-supervised regression branch. This branch is based on light-weighted 2D CNN and only uses one frame as input. In the training stage, the input of TMNet is a video sequence, the classification branch combined with the base backbone is responsible for learning the video sequence’s spatial-temporal feature. Meanwhile, the self-supervised regression branch aims to learn the same spatial-temporal feature under the supervision of the classification branch’s output. And the regression branch’s input is a single-frame feature sampled from the encoded video sequence. In this way, the regression branch is forced to learn temporal information of adjacent frames with one frame. Therefore, TMNet only needs one frame to predict each video’s spatial-temporal information in the inference stage. Finally, TMNet can achieve real-time action recognition and better accuracy by extracting temporal information from a static image. Abundant ablation experiments demonstrate TMNet has a good trade-off between accuracy and speed.

Full Text