Abstract

Human action localization in any long, untrimmed video can be determined from where and what action takes place in a given video segment. The main hurdles in human action localization are the spatiotemporal randomnesses of their happening in a parallel mode which means the location (a particular set of frames containing action instances) and duration of any particular action in real-life video sequences generally are not fixed. At another end, the uncontrolled conditions such as occlusions, viewpoints and motions at the crisp boundary of the action sequences demand to develop a fast deep network which can be easily trained from unlabeled samples of complex video sequences. Motivated from the facts, we proposed a weakly supervised deep network model for human action localization. The model is trained from unlabeled action samples from UCF50 action benchmark. The five-channel data obtained from the concatenation of RGB (three-channel) and optical flow vectors (two-channel) are fed to the proposed convolutional neural network. LSTM network is used to yield the region of action happening area. The performance of the model is tested on UCF-sports dataset. The observation and comparative results reflect that our model can localize any action from annotation-free data samples captured in uncontrolled conditions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call