Abstract

This paper presents an approach for human action recognition by finding the discriminative key frames from a video sequence and representing them with the distribution of local motion features and their spatiotemporal arrangements. In this approach, the key frames of the video sequence are selected by their discriminative power and represented by the local motion features detected in them and integrated from their temporal neighbors. In the key frame’s representation, the spatial arrangements of the motion features are captured in a hierarchical spatial pyramid structure. By using frame by frame voting for the recognition, experiments have demonstrated improved performances over most of the other known methods on the popular benchmark data sets. Recognizinghumanactionfromimagesequencesis an appealingyet challengingproblem in computer vision with many applications including motion capture, human-computer interaction, environment control, and security surveillance. In this paper, we focus on recognizing the activities of a person in an image sequence from local motion features and their spatiotemporal arrangements. Our approach is motivated by the recent success of “bag-of-words” model for general object recognition in computer vision[21, 14]. This representation, which is adapted from the text retrieval literature, models the object by the distribution of words from a fixed visual code book, which is usually obtained by vector quantization of local image visual features. However, this method discards the spatial and the temporal relations among these visual features, which could be helpful in object recognition. Addressing this problem, our approach uses a hierarchical representation for the key frames of a given video sequence to integrate information from both the spatial and the temporal domains. We first apply a spatiotemporal feature detector to the video sequence and obtain the local motion features. Then we generate a visual word code book by quantization of the local motion features and assign a word label to each of them. Next we select key frames of the video sequence by their discriminative power. Then, for each key frame, we integrate the visual words from its nearby frames, divide the key frame spatially into finer subdivisions and compute in each cell the histograms of the visual words detected in this key frame and its temporal neighbors. Finally, we concatenate the histograms from all cells and use

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call