Abstract

The task of video action segmentation in weakly supervised learning is one of the key points of video content understanding. The ground truth only provides a set of actions but not frame level features. A popular type uses a neural network framework to train the prediction model. Our key contribution is a new Hidden Markov Model (HMM) grounded on a Temporal Convolutional Network (TCN) to label video frames, and thus generate a pseudo-ground truth for the subsequent pseudo-supervised training. In testing, we use Viterbi algorithm to generate the time action sequence to be selected, and finally get the largest posteriori sequence. We evaluate the performance of action segmentation task on breakfast dataset. The research experiments on this dataset show that our model gets efficient performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call