Compared to images, video, as an increasingly mainstream visual media, contains more semantic information. For this reason, the computational complexity of video models is an order of magnitude larger than their image-level counterparts, which increases linearly with the square number of frames. Constrained by computational resources, training video models to learn long-term temporal semantics end-to-end is quite a challenge. Currently, the main-stream method is to split a raw video into clips, leading to incomplete fragmentary temporal information flow and failure of modeling long-term semantics. To solve this problem, in this paper, we design the Markov Progressive framework (MaPro), a theoretical framework consisting of the progressive modeling method and a paradigm model tailored for it. Inspired by natural language processing techniques dealing with long sentences, the core idea of MaPro is to find a paradigm model consisting of proposed Markov operators which can be trained in multiple sequential steps and ensure that the multi-step progressive modeling is equivalent to the conventional end-to-end modeling. By training the paradigm model under the progressive method, we are able to model long videos end-to-end with limited resources and ensure the effective transmission of long-term temporal information. We provide detailed implementations of this theoretical system on the mainstream CNN- and Transformer-based models, where they are modified to conform to the Markov paradigm. The theoretical paradigm as a basic model is the lower bound of the model efficiency. With it, we further explore more sophisticated designs for CNN and Transformer-based methods specifically. As a general and robust training method, we experimentally demonstrate that it yields significant performance improvements on different backbones and datasets. As an illustrative example, the proposed method improves the SlowOnly network by 4.1 mAP on Charades and 2.5 top-1 accuracy on Kinetics. And for TimeSformer, MaPro improves its performance on Kinetics by 2.0 top-1 accuracy. Importantly, all these improvements are achieved with a little parameter and computation overhead. We hope the MaPro method can provide the community with new insight into modeling long videos.
Read full abstract