Temporal action detection in untrimmed videos is a challenging task aiming to predict the boundary and category of action instances. It can be useful in transportation. In this study, we propose a two-stage framework Malleable Boundary Network (MB-Net) to adaptively regress proposals based on finer scores. In particular, MB-Net consists of a Potential Boundary Generator in the first stage and an Adaptive Proposal Detector in the second stage. First, the Potential Boundary Generator fuses multiple sets of flexible score sequences to obtain tentative proposals through a frame-level feature in an anchor-free way. Then, the Adaptive Proposal Detector employs parallel modules to filter, classify and regress proposals adaptively. Besides, we propose an easy-to-realize feature augmented method Structured Temporal Segment Pooling, which makes full use of the information throughout the whole proposal. Experiments show that MB-Net achieves state-of-the-art performance on popular benchmarks THUMOS-14 and Activity-1.3 with an improvement of 1.9% and 1.2%.
Read full abstract