Abstract

To find out all actions included in an untrimmed video, temporal action detection localizes the starting and ending of each action, and identify their categories, simultaneously. Different with trimmed video which always involves a single action instance, the untrimmed video is much more complicated. That is, there are not only multiple action instances, but also multiple background clips among action instances. This complexity presents a great challenge to temporal action detection. Structured Segment Networks, SSN, a recently presented temporal action detection method, constructs a two-stage pyramid structure to obtain temporal features of an action instance to complete its classification and location. SSN works well except that there are multiple action instances varying greatly in amplitude and duration. This paper introduces a feature pyramid network in the feature extraction phrase of SSN to expand the receptive field of the network to obtain features with different scales to predict action completeness, category, and boundary, respectively. Compared with the original SSN and other existing models, experiment results on dataset THUMOS14 shows the effectiveness of our method

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call