The computational complexity of video models increases linearly with the square number of frames. Thus, constrained bycomputational resources, training video models to learn long-term temporal semantics end-to-end is quite a challenge. Currently, the main-stream method is to split a raw video into clips, leading to incomplete fragmentary temporal information flow and failure of modeling long-term semantics. In this paper, we design the Markov Progressive framework (MaPro), a theoretical framework consisting of the progressive modeling method and a paradigm model tailored for it. Thecore idea of MaPro is to find a paradigm model consisting of proposed Markov operators which can be trained in multiple sequential steps and ensure that the multi-step progressive modeling is equivalent to the conventional end-to-endmodeling. By training the paradigm model under the progressive method, we are able to model long videos end-to-endwith limited resources and ensure the effective transmission of long-term temporal information. We provide implementations of this theoretical system on the mainstream CNN- and Transformer-based models, where they are modified to conform to the Markov paradigm. As a general and robust training method, we experimentally demonstrate that it yields significant performance improvements on different backbones and datasets. As an illustrative example, the proposed method improves the SlowOnly network by 4.1 mAP on Charades and 2.5 top-1 accuracy on Kinetics. And for TimeSformer, MaPro improves its performance on Kinetics by 2.0 top-1 accuracy. Importantly, all these improvements areachieved with a little parameter and computation overhead.
Read full abstract