Abstract

In this paper, we propose a novel binary local representation for RGB-D video data fusion with a structure-preserving projection. Our contribution consists of two aspects. Toacquire a general feature for the video data, we convert the problem to describing the gradient fields of RGB and depth information of video sequences. With the local fluxes of the gradient fields, which include the orientation and the magnitude of the neighborhood of each point, a new kind of continuous local descriptor called Local Flux Feature(LFF) is obtained. Then the LFFs from RGB and depth channels are fused into a Hamming space via the Structure Preserving Projection (SPP). Specifically, an orthogonal projection matrix is applied to preserve the pairwise structure with a shape constraint to avoid the collapse of data structure in the projected space. Furthermore, a bipartite graph structure of data is taken into consideration, which is regarded as a higher level connection between samples and classes than the pairwise structure of local features. Theextensive experiments show not only the high efficiency of binary codes and the effectiveness of combining LFFs from RGB-D channels via SPP on various action recognition benchmarks of RGB-D data, but also the potential power of LFF for general action recognition.

Highlights

  • R GB-D sensors such as Kinect receive increasing attention in the computer vision community [1]

  • All these sequences are synchronously captured with a Kinect sensor. This dataset collects 10 categories of hand gestures in total: circle, triangle, up-down, right-left, wave, “Z”, cross, comehere, turnaround and pat. All these ten categories are performed with three hand postures: fist, index and flat

  • We illustrate the effectiveness of all the three terms used in Structure Preserving Projection (SPP), i.e., the pairwise label preserving term, the pairwise angle preserving term and the bigraph regularization

Read more

Summary

INTRODUCTION

R GB-D sensors such as Kinect receive increasing attention in the computer vision community [1]. To gain a more robust and accurate representation of samples, local feature descriptors such as: SIFT [8], HOG3D [9], HOG [10], HOF [11] and MBH [12] have been proposed and achieved notable success in classification and recognition Based on these local features, the Bag-of-Words (BoW) model [13] and the Sparse Coding (SC) algorithm [14] have shown their effectiveness for both image classification and action recognition. It represents the sum of all distances from the local features of an image to their corresponding nearest neighbors in each class It was proposed for image classification, it can be applied to any kind of samples represented by local feature descriptors. This makes it extremely fast and useful for many practical applications

RELATED WORK
LOCAL FLUX FEATURE
Flux Computation
Pairwise Structure Preserving
STRUCTURE PRESERVING PROJECTION
Pairwise Angle
Bigraph Regularization
Objective Function and Optimization
Complexity Analysis
EXPERIMENTS AND RESULTS
Datasets and Settings
Compared Results
Comparison with Other Methods
Methods
Statistical Significance Test
Results on RGB Video dataset
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call