Abstract

Human activity recognition from the video is an important problem due to its potential applications in remote surveillance, content-based video retrieval, and in humanoid robots. Most of the visual activity recognition research in the last decade focus on recognizing basic human actions from well-constrained laboratory videos and depend upon fully annotated video dataset for model training. Whereas activity recognition from unconstrained videos is still very challenging due to large variations in object appearance and pose, occlusion, and inter and intraclass variations. It is an extremely laborious task to prepare large scale realistic video activity dataset with detailed annotations of the human, object, and their mutual interactions in each frame. Although it is intuitive to model contextual relationships form fully annotated dataset, however, it is unknown as to how reliably multilevel contextual features can be extracted in the absence of annotated dataset. To mitigate these challenges, we propose a weakly supervised approach for complex human activity recognition from realistic videos. The proposed approach requires only activity labels for each video to train the model. A novel multilevel contextual features and context estimation procedure from the un-annotated dataset is also introduced. Restricted Boltzman machine is used to systematically integrate multilevel contextual features. We evaluate the proposed approach on benchmark realistic surveillance video datasets for human-human and human-object interaction activity recognition. The experimental results show improved accuracies on benchmark datasets without using fully annotated datasets.

Highlights

  • Video content is ubiquitous and is an essential part of our daily life

  • In order to capture structural, co-occurrence and interaction motion patterns among spatiotemporal bins (STBs) of a video, we introduce midlevel contextual features based on novel contest estimation procedure

  • Since we do not assume the availability of human or object bounding boxes in the training dataset, we introduce a procedure to automatically measure the discriminative ability of STB in recognizing a certain activity

Read more

Summary

Introduction

Video content is ubiquitous and is an essential part of our daily life. A huge amount of video clips is recorded by surveillance cameras round the clock and uploaded on the internet by users. The biggest motivation in developing automated human activity recognition system is the fact that it has a wide range of applications [1], [2]. These applications include intelligent ground and aerial video surveillance, monitoring of elderly and disabled people, content-based video retrieval and compression, human-computer interfaces, socially assistive robots, behavioral biometrics, medical diagnosis, assessment and treatment (e.g., musculoskeletal disorders), sports video analysis and highlights, character animation and synthesis and many more. Most of the initial datasets released were recorded in a controlled environment with a simple background in which human alone was present in the video in frontal or side pose [3], [4]. Satisfactory results have been reported by a few researchers on such datasets

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call