Abstract
Boundary detection is a challenging problem in Temporal Action Detection (TAD). While transformer-based methods achieve satisfactory results by incorporating self-attention mechanisms to model global dependencies for boundary detection, they face two key issues. Firstly, they lack explicit learning of local relationships; this limitation results in imprecise boundary detection when subtle appearance changes occur between adjacent clips. Secondly, transformer-based methods lead to feature convergence across multiple actions due to the self-attention mechanism’s tendency to distribute focus across the entire input video, resulting in the prediction of imprecisely overlapping actions. To address these challenges, we introduce the ConvTransformer Attention Network (CTAN), a novel framework comprised of two primary components: (1) The Temporal Attention Block (TAB), a temporal attention mechanism designed to emphasize critical temporal positions enriched with essential action-related features. (2) The ConvTransformer Block (CTB), which employs a hybrid structure for capturing nuanced appearance changes locally and action transitions globally. Facilitated with these components, CTAN is adept at focusing on motion features between overlapping actions, and precisely capturing both local differences between adjacent clips and global action transitions. The extensive experiments on multiple datasets, including THUMOS14, MultiTHUMOS, and ActivityNet, confirm the effectiveness of CTAN.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.