Abstract

Problem This paper addresses the problem of recognising realistic human actions captured in unconstrained environments (Fig. 1). Existing approaches for action recognition have been focused on improving visual feature representation using either spatio-temporal interest points or key-points trajectories. However, these methods are insufficient to handle the situations when action videos are recorded in unconstrained environments because: (1) Reliable visual features are hard to be extracted due to occlusions, illumination change, scale variation and background clutters. (2) Effectiveness of visual features are strongly dependent on the unpredictable characteristics of camera movements. (3) Complicated visual actions result in unequal discriminativeness of visual features. Our Solutions In this paper, we present a novel framework for recognising realistic human actions in unconstrained environments. The novelties of our work lie in three aspects: First, we propose a new action representation based on computing a rich set of descriptors from key point trajectories. Second, in order to cope with drastic changes in motion characteristics with and without camera movements, we develop an adaptive feature fusion method to combine different local motion descriptors for improving model robustness against feature noise and background clutters. Finally, we propose a novel Multi-Class Delta Latent Dirichlet Allocation (MC-∆LDA) model for feature selection. The most informative features in a high dimensional feature space are selected collaboratively rather than independently. Motion Descriptors We first compute trajectories of key-points using KLT tracker and SIFT matching. After trajectory pruning by identifying the Region of Interest (ROI), we compute three types of motion descriptors from the survived trajectories. First, Orientation-Magnitude Descriptor is extracted by quantising orientation and magnitude of motion between two consecutive points in the same trajectory. Second, Trajectory Shape Descriptor is extracted by computing Fourier coefficients of a single trajectory. Finally, Appearance Descriptor is extracted by computing the SIFT features at all points of a trajectory. Interest Point Features We also detect spatio-temporal interest points as they contain complementary information to trajectory features. At an interest point, a surrounding 3D cuboid is extracted. We use gradient vectors to describe these cuboids and PCA to reduce descriptor’s dimensionality. Adaptive Feature Fusion We wish to fuse adaptively trajectory based descriptors with 3D interest point based descriptors according the presence of camera movement. The presence of moving camera is detected by computing the global optical flow over all frames in a clip. If the majority of the frames contain global motion, we regard the clip as being recorded by a moving camera. For clips without camera movement, both interest point and trajectory based descriptors can be computed reliably and thus both types of descriptors are used for recognition. In contrast, when camera motion can be detected, interest point based descriptors are less meaningful so only trajectory descriptors are employed. Collaborative Feature Selection We propose a MC-∆LDA model (Fig. 2) for collaboratively selecting dominant features for classification. We consider each video clip x j is a mixture of Nt topics Φ = {φt}t t=1 (to be discovered), each of which φt is a multinomial distribution over Nw words (visual features). The MC-∆LDA model aims to constrain topic proportion non-uniformly and on a per-clip basis. For each video clip belonging to action category Ac, we model it as a mixture of: (1) Ns t topics which are shared by all Nc category of actions, and (2) Nt,c topics which are uniquely associated with action category Ac. In MC-∆LDA, the nonuniform proportion of topic mixture for a single clip x j is enforced by its action class label c j and the hyperparameter αc for the corresponding action class c. Given the total number of topics Nt = Ns t +∑ Nc c=1 Nt,c, the structure of the MC-∆LDA model, and the observable variables (clips x j and action labels c j), we can learn the Ns t shared topics as well as all ∑c c=1 Nt,c unique topics for all Nc classes of actions. We use the N s t topics shared by all actions for selecting discriminative features. The Ns t shared topics are represented as an Nw×N t dimension matrix Φs. The feature selection can be summarised into two steps: (1) For each feature vk, k = Figure 1: Actions captured in an unconstrained environments, YouTube dataset. From left to right: cycling, diving, soccer juggling, and walking with a dog.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.