With the development of urban intelligence, intelligent identification of human activities has become imminent. Temporal Action Detection (TAD) is designed to identify real human social activities, which is a challenging task in video understanding. The current methods mainly use global features for boundary matching or predefine all possible proposals, while ignoring the interference of background and the causal relevance of the temporal action, resulting in the generation of more redundant proposals and the decline of detection accuracy. To fill this gap, we propose a novel Dilated Convolution Locate and Action Relevant Score model called DCAR. Specifically, DCAR includes a Dilated Location Network (DL-Net) and Action Relevance Calculation (ARC) block. For the DL-Net, we design a Boundary Feature Enhancement (BFE) block, which enhances the boundary feature of actions and fuses the similar features of different channels by pooling and channel squeeze to reduce the interference of the background. Meanwhile, we also design multiple dilated convolutional structures to aggregate long contextual information in time point/interval after boundary enhancement. For the ARC block, we use the hyperbolic space distance and cross attention to calculate the causal correlation of action proposals classification, which removes the misclassification of action proposals. We conduct extensive experiments on Thumos14 and ActivityNet-1.3, which shows our method significantly improves the performance and achieves state-of-the-art results. On Thumos14, at tIoU 0.3, 0.4 and 0.5, it achieves mAP of 63.2%, 57.0% and 48.5%, respectively; On ActivityNet-1.3 it reaches 9.42% at tIoU 0.95.
Read full abstract