Abstract

Still image human action recognition (HAR) is a challenging problem owing to limited sources of information and large intra-class and small inter-class variations which requires highly discriminative features. Transfer learning offers the necessary capabilities in producing such features by preserving prior knowledge while learning new representations. However, optimally identifying dynamic numbers of re-trainable layers in the transfer learning process poses a challenge. In this study, we aim to automate the process of optimal configuration identification. Specifically, we propose a novel particle swarm optimisation (PSO) variant, denoted as EnvPSO, for optimal hyper-parameter selection in the transfer learning process with respect to HAR tasks with still images. It incorporates Gaussian fitness surface prediction and exponential search coefficients to overcome stagnation. It optimises the learning rate, batch size, and number of re-trained layers of a pre-trained convolutional neural network (CNN). To overcome bias of single optimised networks, an ensemble model with three optimised CNN streams is introduced. The first and second streams employ raw images and segmentation masks yielded by mask R-CNN as inputs, while the third stream fuses a pair of networks with raw image and saliency maps as inputs, respectively. The final prediction results are obtained by computing the average of class predictions from all three streams. By leveraging differences between learned representations within optimised streams, our ensemble model outperforms counterparts devised by PSO and other state-of-the-art methods for HAR. In addition, evaluated using diverse artificial landscape functions, EnvPSO performs better than other search methods with statistically significant difference in performance.

Highlights

  • Human action recognition (HAR) aims to identify human actions from visual data

  • Motivated by the well-known two-stream convolutional neural network (CNN) architecture proposed by [31], where spatial and temporal information was extracted by separate streams for action classification, we propose an ensemble model consisting of three EnvPSO-optimised CNN streams, as shown in Fig. 1, to diversify action recognition

  • The mean average precision (MAP) metric is computed to determine the effectiveness of the EnvPSO-optimised CNN ensemble model

Read more

Summary

Introduction

In this respect, video action recognition has attracted significant attention, which takes both spatial and temporal information into account for action classification. The extraction of optical flow information requires substantial additional effort, with significant computational cost and complexity. Some of these issues can be overcome by using still images. Desai et al [4] and Shapovalova et al [5] extracted human body, objects, and human–object interaction, while Li and Fei-Fei [6] and Gupta et al [7] derived human body, objects, and scene contexts for HAR. Body parts, objects, and human–object interaction were used in Maji et al [8], Desai and Ramanan [9], and Delaitre et al [10], whereas Sener et al [11], Yao and FeiFei [12], and Yao et al [13] adopted human body, body parts, objects, and scene contexts

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.