Abstract

Despite the remarkable performance in video-based action recognition over the past several years, current state-of-the-art approaches heavily rely on the optical flow as motion representation. However, computing the optical flow in advance is computationally expensive, which restricts action recognition to be real-time. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. Inspired by Persistence of Vision in human visual system, we design a novel motion cue called Persistence of Appearance (PA), which enables the network to distill motion information directly from adjacent RGB frames. Our PA derives from optical flow and focuses on the small displacements of motion boundaries. Compared with other motion representations, our PA enables the network to achieve competitive accuracy on UCF101. Meanwhile, the inference speed reaches 1855 fps, which is over 120x faster than that of the traditional optical flow based methods. Besides, we devise a decision strategy called Various-timescale inference Pooling (VIP) to empower the network with the ability of long-range temporal modeling across various timescales. We further incorporate the proposed PA and VIP to form a unified framework called Persistent Appearance Network (PAN). Compared with methods using only RGB frames, our delicately designed PAN achieves state-of-the-art results on three benchmark datasets: UCF101, HMDB51 and Kinetics, where it reaches 96.2%, 74.8% and 82.5% accuracy respectively with the run-time speed as high as 595 fps. The code for this project is available at: https://github.com/zhang-can/PAN-PyTorch .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call