Research in skeleton-based human action recognition has generated great interest from the computer vision community, owing to the feasibility and popularity of skeleton data. Some broadly used 2D pose estimation methods, which can be used to generate high-quality skeletons in real time using only low-power equipment, improve 2D skeleton-based action recognition. However, view variations in action representations influence their performance. In this paper, a view agnostic network for 2D skeleton-based action recognition, which narrows the representation differences in view variations by applying an adaptive frame-level skeleton deformation before feature extraction, is proposed. Specifically, the deformation is realized by a body-level affine transformation and a joint-level offset compensation, which aims to approximatively determine a new observation viewpoint for better intragroup consistency in a learning-based data-driven manner. Meanwhile, an adaptive spatial–temporal graph convolutional LSTM (ASTGCN-LSTM), which can model the co-occurrence relationship in the spatial–temporal domain for skeleton sequences more effectively, is introduced. Experiments are conducted on Kinetics-Skeleton and NTU120 RGB+D action recognition datasets, which are used to demonstrate the effectiveness of the proposed method.
Read full abstract