Human activity prediction is defined as inferring the high-level activity category with the observation of only a few action units. It is very meaningful for time-critical applications such as emergency surveillance. For efficient prediction, we represent the ongoing human activity by using body part movements and taking full advantage of inherent sequentiality, then find the best matching activity template by a proper aligning measurement.In streaming videos, dense spatio-temporal interest points (STIPs) are first extracted as low-level descriptors for their high detection efficiency. Then, sparse grouplets, i.e., clustered point groups, are located to represent body part movements, for which we propose a scale-adaptive mean shift method that can determine grouplet number and scale for each frame adaptively. To learn the sequentiality, located grouplets are successively mapped to Recurrent Self-Organizing Map (RSOM), which has been pre-trained to preserve the temporal topology of grouplet sequences. During this mapping, a growing RSOM trajectory, which represents the ongoing activity, is obtained. For the special structure of RSOM trajectory, a combination of dynamic time warping (DTW) distance and edit distance, called DTW-E distance, is designed for similarity measurement. Four activity datasets with different characteristics such as complex scenes and inter-class ambiguities serve for performance evaluation. Experimental results confirm that our method is very efficient for predicting human activity and yields better performance than state-of-the-art works.
Read full abstract