Abstract
The recent 3D convolutional neural network (3D-CNN) is a promising candidate for solving the action recognition problem by providing attractive algorithm-level performance. Due to the excessive amount of computational costs, however, it is almost impractical to apply the advanced 3D-CNN architecture to the resource-limited real-time embedded system. In this work, we present several optimization schemes that can relax the complexity of 3D-CNN processing without sacrificing recognition accuracy. More precisely, we first develop several 3D-CNN architectures for exploiting the trade-off between the network complexity and recognition performance. Evaluating the current confidential level, then, the proposed method dynamically changes the network structure to be used for the next clip-level inference. In addition, we introduce a systematic way of managing the network sequence for minimizing the computing overheads while supporting the acceptable algorithm-level performance. Compared to the previous works, as a result, the proposed approaches drastically relax the processing costs as well as the energy consumption by selecting the simplest 3D-CNN architecture at the run time, allowing the cost-effective action recognition for embedded edges.
Highlights
R ECENTLY, the action recognition has been gaining popularity due to the increased demand from surveillance systems [1]–[6], disaster monitoring solutions [7]–[9], broadcasting platforms [10]–[14], and even sports analytics [15], [16]
In contrast to the image recognition cases, the conventional 2D convolutional neural network (2D-CNN) cannot provide sufficient accuracy as it only focuses on spatial features in an image frame [23]
When we target the standalone action recognition systems, which are expected to the market changer [5], it is necessary to implement the high-performance but cost-effective 3D convolutional neural network (3D-CNN) processing for resource-limited embedded computing platforms
Summary
R ECENTLY, the action recognition has been gaining popularity due to the increased demand from surveillance systems [1]–[6], disaster monitoring solutions [7]–[9], broadcasting platforms [10]–[14], and even sports analytics [15], [16]. Simulation results reveal that the proposed dynamic network scheduling significantly reduces the number of MAC operations as well as the number of memory accesses for achieving a similar level of recognition accuracy, potentially providing the cost-effective 3D-CNN processing when compared to the conventional system. The sound information or pre-calculated features, which are normally existed in the compressed video data, can be utilized to provide the additional information for enabling the cliplevel processing [37] or selecting the dominant clips strongly related to the actions [38], [39], increasing the recognition accuracy while even reducing the computational complexity. In contrast that the conventional 3D-CNN operation described in Algorithm 1 applies the identical clip-level processing for recognizing actions in a video, the proposed method dynamically changes the processing mode of each clip to minimize the energy consumption. For the efficient network scheduling, it is important to carefully define the metric that measures the confidence level of each clip-level processing accurately
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have