Abstract

After a few seconds of an action, the human eye only needs a few photos to judge, but the action recognition network needs hundreds of frames of input pictures for each action. This results in a large number of floating point operations (ranging from 16 to 100 G FLOPs) to process a single sample, which hampers the implementation of graph convolutional networks (GCN)-based action recognition methods when the computation capabilities are restricted. A common strategy is to retain only the portions of the frames, but this results in the loss of important information in the discarded frames. Furthermore, the selection progress of key frames is too independent and lacks connections with other frames. To solve these two problems, we propose a fusion sampling network to generate fused frames to extract key frames. Temporal aggregation is used to fuse adjacent similar frames, thereby reducing information loss and redundancy. The concept of self-attention is introduced to strengthen the long-term association of key frames. The experimental results on three benchmark datasets show that the proposed method achieves performance levels that are competitive with state-of-the-art methods while using only 16.7% of the number of frames (∼50 and 300 frames in total). On the NTU 60 dataset, the number of FLOPs and Params with a single-channel input are 3.776 G and 3.53 M, respectively. This would greatly reduce the excessive computational power cost in practical applications due to the large amount of data processed by action recognition.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.