Due to its important applications in industry, the research of action recognition has attracted many attention. In particular, skeleton-based human action recognition methods have evolved to become more and more competitive. Nevertheless, occlusion is still a very challenging task in human action recognition as yet. Off-the-shelf works are usually based on complete skeleton data, few people consider action recognition in occlusion. For the sake of improving the recognition accuracy on occluded skeleton data, we put forward an occlusion-aware multi-stream fusion graph convolutional network (dubbed MSFGCN). Multiple streams are comprised in MSFGCN, and different occlusion cases can be disposed by different streams. Besides, in order to construct more discriminative features, multi-modal features were extracted simultaneously, such as joint coordinates, relative coordinates, small-scale temporal differences and large-scale temporal differences. In particular, it is the first time to take advantage of motion features at large and small scales at the same time, which helps to distinguish actions with different motion amplitudes. What's more, considering the different importance of different parts of the human body for different action recognition, the content adaptation operation is used to further optimize the recognition performance. Effective training strategies are also presented to further improve the performance of the model. Experimental results show that the proposed MSFGCN has a great advantage over other methods on occluded skeleton datasets. To showcase the effectiveness of different modules, extensive ablation experiments are performed on various skeleton action recognition datasets.
Read full abstract