In order to eliminate the impact of camera viewpoint factors and human skeleton differences on the action similarity evaluation and to address the issue of human action similarity evaluation under different viewpoints, a method based on deep metric learning is proposed in this article. The method trains an automatic encoder-decoder deep neural network model by means of a homemade synthetic dataset, which maps the 2D human skeletal key point sequence samples extracted from motion videos into three potential low-dimensional dense spaces. Action feature vectors independent of camera viewpoint and human skeleton structure are extracted in the low-dimensional dense spaces, and motion similarity metrics are performed based on these features, thereby effectively eliminating the effects of camera viewpoint and human skeleton size differences on motion similarity evaluation. Specifically, when extracting the action information feature vectors using the automatic encoder-decoder network model, a sliding window method is used to divide the key point sequences of each limb part into sequence patches, and the action information feature vectors independent of the camera viewpoint and skeleton structure are extracted in a smaller time unit, so as to obtain a more refined action similarity evaluation result. In addition, the dynamic time warping (DWT) algorithm is exploited to align the sequence of action information feature vectors temporally, which solves the problem of temporal axis discrepancies in realizing similarity metrics based on action information feature vectors. More accurate and reliable human action similarity evaluation results were achieved by the loss function composed of three components, namely, cross-reconstruction loss, reconstruction loss and triplet loss. Finally, the performance of the algorithm is evaluated in a homemade dataset, and the experimental results show that the method could effectively eliminate the influence of the differences in camera viewpoints and human skeleton sizes on the similarity evaluation of actions, and generate more reliable and closer to the human subjective perception of similarity evaluation results for human actions captured from different viewpoints or with varying skeleton sizes.
Read full abstract