Abstract

Action recognition in video is widely applied in video indexing, intelligent surveillance, multimedia understanding, and other fields. A typical human action contains the spatiotemporal information from various scales. Learning and fusing the multi-temporal-scale information make action recognition more reliable in terms of recognition accuracy. To demonstrate this argument, in this paper, we use Res3D, a 3D Convolution Neural Network (CNN) architecture, to extract information in multiple temporal scales. And in each temporal scale, we transfer the knowledge learned from RGB to 3-channel optical flow (OF) and learn information from RGB and OF fields. We also propose Parallel Pair Discriminant Correlation Analysis (PPDCA) to fuse the multi-temporal-scale information into action representation with a lower dimension. Experimental results show that compared with single-temporal-scale method, the proposed multi-temporal-scale method gains higher recognition accuracy, and spends more time on feature extraction, but less time on classification due to the representation with lower dimension. Moreover, the proposed method achieves recognition performance comparable to that of the state-of-the-art methods. The source code and 3D filter animations are available online: https://github.com/JerryYaoGl/multi-temporal-scale .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call