Recent temporal action detection models have focused on end-to-end trainable approaches to utilize the representational power of backbone networks. Despite the advantages of end-to-end trainable methods, these models still employ a small spatial resolution (e.g., 96 × 96) due to the inefficient trade-off between computational cost and spatial resolution. In this study, we argue that a simple pooling method (e.g., adaptive average pooling) acts as a bottleneck at the spatial aggregation part, restricting representational power. To address this issue, we propose a temporal-wise spatial attentive pooling (TSAP), which alleviates the bottleneck between the backbone and the detection head using a temporal-wise attention mechanism. Our approach mitigates the inefficient trade-off between spatial resolution and computational cost, thereby enhancing spatial scalability in temporal action detection. Moreover, TSAP is adaptable to previous end-to-end approaches by simply replacing the spatial pooling part. Our experiments demonstrated the essential role of spatial aggregation, and consistent improvements are observed by incorporating TSAP into previous end-to-end methods.
Read full abstract