Despite a big volume of research on action recognition, little attention has been given to individual action recognition in poor-quality spectator crowd scenes. It is an important scenario, because most of the surveillance systems generate poor-quality videos, though current state-of-the-art methods may not be effectively applicable. Therefore recognizing actions performed by individuals in poor-quality spectator crowd scenes is an unsolved problem. In such cases, the main challenge is localizing person proposals for each actor in the crowd. This challenge becomes more difficult when occlusion is severe. In this work, we propose a novel approach to find person proposals in poor-quality spectator crowds using crowd-based constraints. First, we define persons in the crowd by using efficient person head detectors. We exploit person head size to estimate the person bounding box using linear regression. Then, we use distribution of heads in the crowd image to estimate more accurate person proposals. Motion trajectories are independently computed in the video without considering persons and then assigned to each person based on a novel distance measure computed between the trajectory and the person proposal. The set of trajectories and associated motion and texture-based features in overlapped time windows are used to compute the final feature vector. For each time window using early information fusion in the bag of visual-words framework, cumulative feature vectors are computed encoding action information. Experiments are performed on a publicly available real-world spectator crowd dataset containing as many as 150 actors performing multiple actions at the same time. Our experiments have demonstrated excellent performance of the proposed technique.
Read full abstract