Abstract

Video-based action recognition is a challenging task, which demands carefully considering the temporal property of videos in addition to the appearance attributes. Particularly, the temporal domain of raw videos usually contains significantly more redundant or irrelevant information than still images. For that, this paper proposes an unsupervised video-based action recognition approach with <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">imagining motion</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">perceiving appearance</i> , called IMPA, by comprehensively learning the spatio-temporal characteristics inherited in videos, with a particular emphasis on the moving object for action recognition. Specifically, a self-supervised <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Motion Extracting Block</i> (MEB) is designed to extract the principal motion features by focusing on the large movement of the moving object, based on the observation that humans can infer complete motion trajectories from partial moving objects. To further take the indispensable appearance attribute in videos into account, an unsupervised <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Appearance Learning Block</i> (ALB) is developed to perceive the static appearance, thus in combination with the MEB to recognize actions. Extensive validation experiments and ablation studies on multiple datasets demonstrate that our proposed IMPA approach obtains superior performance and surpasses other classical and state-of-the-art unsupervised action recognition methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call