Abstract

AbstractDue to redundancy in the spatiotemporal neighborhood and the global dependency between video frames, video recognition remains a challenge. Some prior works have been mainly driven by 3D convolutional neural networks (CNNs) or 2D CNNs with a well‐designed module for temporal information. However, convolution‐based networks lack the capability to capture the global dependency due to the limited receptive field. Alternatively, transformer for video recognition is proposed to build long‐range dependency between frame patches. Nevertheless, most transformer‐based networks have significant computational costs because attention is calculated among all the tokens. Based on these observations, we propose an efficient network which we dub LGANet. Unlike conventional CNNs and transformers for video recognition, the LGANet can tackle both spatiotemporal redundancy and dependency by learning local and global token affinity in shallow and deep layers, respectively . Specifically, local attention is implemented in the shallow layers to reduce parameters and eliminate redundancy. In the deep layers, spatial‐wise and channel‐wise self‐attention are embedded to realize the global dependency of high‐level features. Moreover, several key designs are made in the multi‐head self‐attention (MSA) and feed‐forward network (FFN). Extensive experiments are conducted on the popular video benchmarks, such as Kinetics‐400, Something‐Something V1&V2. Without any bells and whistles, the LGANet achieves state‐of‐the‐art performance. The code will be released soon.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call