Learning Temporal Cues for Fine-Grained Action Recognition

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Fine-grained action recognition aims to classify complex coarse actions into finer ones. It suffers due to varying temporal scales and subtle differences between categories, which brings significant challenges to the temporal perception capabilities of current models. In this paper, we proposed a novel Temporal Cues Transformer (TCT) to guide the recognition of fine-grained action by exploiting detailed temporal cues in video duration and action sequence. The proposed TCT consists of a duration-aware encoder and a hierarchical sequence aggregation decoder. In the encoder, we extract duration-aware representations from the video and its duration. In the decoder, we reinforce the learning of action sequence by first searching the fine-grained elements of action with a hierarchical element query module, then aggregating the elements with a sequence aggregation module to predict the action category. Extensive experiments on widely used Diving48 and FineGym datasets demonstrate the superiority of the proposed method compared to the state-of-the-art methods.

Save Icon
Up Arrow
Open/Close