Temporal refinement network: Combining dynamic convolution and multi-scale information for fine-grained action recognition

Jirui Di,Zhengping Hu,Shuai Bi,Hehao Zhang,Yulu Wang,Zhe Sun

doi:10.1016/j.imavis.2024.105058

Abstract

Fine-grained action recognition is challenging due to the nearly identical context, limited background information, and less distinct inter-class differences compared to coarse-grained actions.Effectively capturing spatio-temporal information is crucial for fine-grained action recognition models. To address the limitations of coarse-grained models in describing spatio-temporal context, we propose a Temporal Refinement Block (TRB) as an efficient component for fine-grained action recognition. The TRB enables our model to effectively model underlying semantics and global dependencies by generating spatial–temporal kernels of different scales and performing fully connected operations within the temporal dimension. Our experiments demonstrate the effectiveness of TRB in learning latent semantics and global dependencies. To further enhance the framework's performance, we incorporate an enhanced spatio-temporal pyramidal network (TPN) that collects beat information and utilizes dilated convolutions to boost multi-scale features.We refer to the proposed framework as the Temporal Refinement Network, abbreviated as TRN.Our TRN achieves competitive performance on the FineGym and Diving48 benchmarks.

Full Text