Abstract

Many fusion methods have been developed to improve the performance of action recognition with RGB and depth data, where learning conjoint representation of heterogeneous modalities by a single network has not been paid enough attention. We present an associated representation method for RGB-D action recognition using the siamese network with contrastive-center loss. First, some samples of each class and modality data are selected as the references to construct positive and negative pairs. Each positive pair consists of a training sample and its class reference, whereas the negative pair only involves different classes reference. Then these pairs are inputted to a two-stream siamese network to learn the collaborative representation of RGB and depth data. Two ranking losses, namely intramodal and cross-modal contrastive-center loss, are developed to impose similarity/dissimilarity metric on those pairs. Specifically, the intramodal contrastive-center loss measures the relationship between samples and references from RGB or depth data. The cross-modal contrastive-center loss measures the relationship of visual and depth features in a same low-dimensional space. Finally, the ranking losses and a softmax loss are jointly optimized for action recognition. The proposed method is evaluated on two large action datasets, LAP IsoGD and NTU RGB+D, and a smaller dataset, Sheffield Kinect gesture. The experimental results demonstrate that the proposed method surpasses most of the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call