Abstract

Extracting good action representations from video frames is an intricate challenge due to the presence of moving objects of various sizes across current action recognition datasets. Most of the current action recognition methodologies have paid scant attention to this characteristic and have relied on deep learning models to automatically solve it. In this paper, we introduce a multi-scale receptive fields convolutional network (MSRFNet), which is fashioned after the pseudo-3D residual network architecture to mitigate the impact of scale variation in moving objects. The crux of MSRFNet is the integration of a multi-scale receptive fields block, which incorporates multiple dilated convolution layers that share identical convolutional parameters, but feature different receptive fields. MSRFNet leverages three scales of receptive fields to extract features from moving objects of diverse sizes, striving to produce scale-specific feature maps with a uniform representational power. Through visualization of the attention of MSRFNet, we analyze how the model re-allocates its attention to moving objects after implementing the multi-scale receptive fields approach. Experimental results on the benchmark dataset demonstrate that MSRFNet achieves improvement of 3.2% on UCF101, improvement of 5.8% on HMDB51, and improvement of 7.7% on Kinetics-400 compared with the baseline. Compared with state-of-the-art techniques, MSRFNet gets comparable or superior results, thereby affirming the effectiveness of the proposed approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call