Abstract

Herein, a novel methodology is proposed for real-time recognition of human activity in a compressed domain of videos based on motion vectors and self-attention mechanism using vision transformers, and it is termed as motion vectors and vision transformers (MVViT). The videos in MPEG-4 and H.264 compression formats are considered for this study. Any video source without any prior setup could be considered by adopting the proposed method to the corresponding video codecs and camera settings. Existing algorithms for recognition of human action in a compressed video have some limitations in this regard, such as (i) requirement of keyframes at a fixed interval, (ii) usage of P frames only, and (iii) normally support single codec only. These limitations are overcome in the proposed method by using arbitrary keyframe intervals, using both P and B frames, and supporting MPEG-4 as well as H.264 codecs. The experimentation is carried out using the benchmark datasets, namely, UCF101, HMDB51, and THUMOS14, and the recognition accuracy in a compressed domain is found to be comparable to that observed in raw video data but at reduced cost of computation. The proposed MVViT method has outperformed other recent methods in terms of a lesser (61.0%) number of parameters and (63.7%) Giga Floating Point Operations Per Second (GFLOPS), while significantly improving accuracy by 0.8%, 5.9% and 16.6% for UCF101, HMDB51 and THUMOS14, respectively. Also, it is observed that the speed is increased by 8% in case of UCF101 when compared to the highest speed reported in the literature on the same dataset. The ablation study of the proposed method has been done using MVViT variants for different codecs and the performance analysis is done in comparison with the state-of-the-art network models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call