Abstract

Action segmentation aims at temporally locating and classifying video segments in long untrimmed videos, which is of particular interest to many applications like surveillance and robotics. While most existing methods tackle this task by predicting frame-wise probabilities and adjusting them via high-level temporal models, recent approaches classify every video frame directly with temporal convolutions. However, there are limits to generate high quality predictions due to ambiguous information in the video frames. In this paper, in order to address the limitations of existing methods in temporal action segmentation task, we propose an end-to-end multi-stage architecture, Gated Forward Refinement Network (G-FRNet). In G-FRNet, each stage makes a prediction that is refined progressively by next stage. Specifically, we propose a new gated forward refinement network to adaptively correct the errors in the prediction from previous stage, where an effective gate unit is used to control the refinement process. Moreover, to efficiently optimize the proposed G-FRNet, we design an objective function that consists of a classification loss and a multi-stage sequence-level refinement loss that incorporates segmental edit score via policy gradient. Extensive evaluation on three challenging datasets (50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset) shows our method achieves state-of-the-art results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call