Abstract

In temporal action localization (TAL), semi-supervised learning is a promising technique to mitigate the cost of precise boundary annotations. Semi-supervised approaches employing consistency regularization (CR), encouraging models to be robust to the perturbed inputs, have achieved great success in image classification problems. The success of CR is largely depended on the perturbations, where instances are perturbed to train a robust model without altering their semantic information. However, the perturbations for image or video classification tasks are not fit to apply to TAL. Since videos in TAL are too long to train the model with raw videos in an end-to-end manner. In this paper, we devise a method named K-farthest crossover to construct perturbations based on video features and apply it to TAL. Motivated by the observation that features in the same action instance become more and more similar during the training process while those in different action instances or backgrounds become more and more divergent, we add perturbations to each feature along temporal axis and adopt CR to encourage the model to retain this observation. Specifically, for a feature, we first find the top-k dissimilar features and average them to form a perturbation. Then, similar to chromosomal crossover, we select a large part of the feature and a small part of the perturbation to recombine a perturbed feature, which preserves the feature semantics yet enough discrepancy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call