Real-time multiple spatiotemporal action localization and prediction approach using deep learning

Ahmed Ali Hammam,Mona M Soliman,Aboul Ella Hassanien

doi:10.1016/j.neunet.2020.05.017

Abstract

Detecting the locations of multiple actions in videos and classifying them in real-time are challenging problems termed ”action localization and prediction” problem. Convolutional neural networks (ConvNets) have achieved great success for action localization and prediction in still images. A major advance occurred when the AlexNet architecture was introduced in the ImageNet competition. ConvNets have since achieved state-of-the-art performances across a wide variety of machine vision tasks, including object detection, image segmentation, image classification, facial recognition, human pose estimation, and tracking. However, few works exist that address action localization and prediction in videos. The current action localization research primarily focuses on the classification of temporally trimmed videos in which only one action occurs per frame. Moreover, nearly all the current approaches work only offline and are too slow to be useful in real-world environments.In this work, we propose a fast and accurate deep-learning approach to perform real-time action localization and prediction. The proposed approach uses convolutional neural networks to localize multiple actions and predict their classes in real time. This approach starts by using appearance and motion detection networks (known as ”you only look once” (YOLO) networks) to localize and classify actions from RGB frames and optical flow frames using a two-stream model. We then propose a fusion step that increases the localization accuracy of the proposed approach. Moreover, we generate an action tube based on frame level detection. The frame by frame processing introduces an early action detection and prediction with top performance in terms of detection speed and precision.The experimental results demonstrate this superiority of our proposed approach in terms of both processing time and accuracy compared to recent offline and online action localization and prediction approaches on the challenging UCF-101-24 and J-HMDB-21 benchmarks.

Full Text