The ability to recognize human actions using a single viewpoint is affected by phenomena such as self-occlusions or occlusions by other objects. Incorporating multiple cameras can help overcome these issues. However, the question remains how to efficiently use information from all viewpoints to increase performance. Researchers have reconstructed a 3D model from multiple views to reduce dependency on viewpoint, but this 3D approach is often computationally expensive. Moreover, the quality of each view influences the overall model and the reconstruction is limited to volumes where the views overlap. In this paper, we propose a novel method to efficiently combine 2D data from different viewpoints. Spatio-temporal features are extracted from each viewpoint and then used in a bag-of-words framework to form histograms. Two different sizes of codebook are exploited. The similarity between the obtained histograms is represented via the Histogram Intersection kernel as well as the RBF kernel with $$\chi ^2$$?2 distance. Lastly, we combine all the basic kernels generated by selection of different viewpoints, feature types, codebook sizes and kernel types. The final kernel is a linear combination of basic kernels that are properly weighted based on an optimization process. For higher accuracy, the sets of kernel weights are computed separately for each binary SVM classifier. Our method not only combines the information from multiple viewpoints efficiently, but also improves the performance by mapping features into various kernel spaces. The efficiency of the proposed method is demonstrated by testing on two commonly used multi-view human action datasets. Moreover several experiments indicate the efficacy of each part of the method on the overall performance.
Read full abstract