Abstract
In the field of action recognition based on RGB videos, it is infeasible to train deep networks on dozens or hundreds of frames because of limits on computational complexity and memory. Previous works commonly adopted a sparse sampling strategy, which unfortunately leads to missing crucial frames and insufficient modelling for short-range motions. In this letter, an effective motion information aggregation module (MIAM) that utilises convolutional neural networks to aggregate motion information from multiple frames into one single frame is proposed. This allows to train deep networks end-to-end with densely sampled frames efficiently. The MIAM enables the model to gather motion information at every instant of action, which avoids missing subtle movements. Experiments on the NTU RGB+D 60 and NTU RGB+D 120 datasets verify that the MIAM significantly improves the recognition accuracy with very limited extra computational cost and exhibits unique advantages for recognising subtle actions.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have