Abstract
Action quality assessment (AQA) is a challenging vision task due to the complexity and variance of the scoring rules embedded in the videos. Recent approaches have reduced the prediction difficulty of AQA via learning action differences between videos, but there are still challenges in learning scoring rules and capturing feature differences. To address these challenges, we propose a two-path target-aware contrastive regression (T2CR) framework. We propose to fuse direct and contrastive regression and exploit the consistency of information across multiple visual fields. Specifically, we first directly learn the relational mapping between global video features and scoring rules, which builds occupational domain prior knowledge to better capture local differences between videos. Then, we acquire the auxiliary visual fields of the videos through sparse sampling to learn the commonality of feature representations in multiple visual fields and eliminate the effect of subjective noise from a single visual field. To demonstrate the effectiveness of T2CR, we conduct extensive experiments on four AQA datasets (MTL-AQA, FineDiving, AQA-7, JIGSAWS). Our method is superior to state-of-the-art methods without elaborate structural design and fine-grained information.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have