Abstract

To enable temporal action localization, the computer needs to recognize the locations and classes of action instances in a video. The main challenge to temporal action detection is that the videos are often long and untrimmed, consisting of varying action content. Existing temporal action detection frameworks exhibit a gap between the training and testing phases, which is detrimental to model performance. Specifically, all positive samples are trained identically in the training phase. By contrast, in the testing phase, the positive samples with the best classification and localization scores are selected, while all others are suppressed. To mitigate this issue, we build an auxiliary branch to unify the training and testing procedures. In the construction of the auxiliary branch, we design a dynamic weighting strategy based on curriculum learning, where the weights of training samples are a combination of their classification and localization scores. Motivated by the speculation of curriculum learning, we emphasize the importance of classification and localization scores in different training stages. The classification score accounts for a higher proportion of the combined score in the early stages of the training process. As the epoch increases, the localization score gradually increases in proportion as well. The experimental results demonstrate that our methodology of curriculum-based learning enhances the performance of current action localization techniques. On THUMOS14, our technique outperforms the existing state-of-the-art technique (57.6% vs 55.5%). And the performance on ActivityNet v1.3 (mAP@Avg) reaches 35.4%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call