The video recognition community is undergoing a significant change in backbone shifting from CNNs to transformers. However, due to the temporal information existing in the video, vision transformers, which have been shown to be effective in image tasks, cannot model spatio-temporal structure in video recognition. In addition, current transformer-based video models exclusively utilize either class tokens or visual tokens for classification, failing to combine the merits of these two types of tokens. To solve the above problems, we propose a novel video second-order transformer network (ViSoT). As there are complex motion and appearance information in video, visual tokens of the last ViSoT layer are aggregated by cross-covariance pooling to model spatio-temporal information, which is combined with class tokens for classification. Meanwhile, a fast singular value power normalization is further introduced to achieve effective aggregation of visual tokens. For temporal modeling, before ViSoT block temporal convolutions are performed continuously in the early stage after the convolution stem via a short-cut connection. In ViSoT, token shift module and space–time attention are proposed for modeling temporal relations and spatio-temporal interaction across adjacent frames respectively. Validating ViSoT on four video benchmarks demonstrates its effectiveness and efficiency, comparable to or better than state-of-the-art methods with fewer parameters and GFLOPs.
Read full abstract