Abstract

In this paper, we propose a weakly-supervised approach called Order-Constrained Representation Learning (OCRL) to predict future actions from instructional videos by observing incomplete steps of actions. Most conventional methods focus on predicting actions based on partially observed video frames, which mainly study low-level semantics such as motion consistency. Unlike performing a single action, completing a task in an instructional video usually requires several steps of action and longer periods. Motivated by the fact that the order of action steps is key to learning task semantics, we develop a new frame of contrastive loss, called StepNCE, to integrate the shared semantic information between step order and task semantics under the framework of the memory bank-based momentum-updating algorithm. Specifically, we learn the video representations from step order-rearranged trimmed video clips based on the proposed task-consistency rule and order-consistency rule. Our StepNCE loss can be used to pre-train a video feature encoder, which is then fine-tuned to carry out the instructional video prediction task. Our approach digs deeper into the sequential logic between different action steps with respect to a certain task, which is able to promote the video understanding methods to a new semantic level. We evaluate our method on five popular instructional video and action prediction datasets: COIN, CrossTask, UT-Interaction, BIT-Interaction, and ActivityNet v1.2, and the results show that our approach gains improvements from conventional prediction methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call