Fusion of Global and Local Deep Features Using Bag of Words and VLAD Models for Human Activity Recognition

Amany Abdelbaky,Saleh Aly

doi:10.1109/smart-tech49988.2020.00035

Abstract

Human activity recognition is an important and challenging topic for computer vision research community. Action representation using deep learning models are currently the dominant technique compared with other methods. However, supervised convolutional neural networks require large computational and memory resources to optimize their parameters. Recently, a simple unsupervised deep learning architecture Principal Component Analysis Network (PCANet) has emerged as an alternative of Convolutional Neural Networks (CNNs) and has significant accomplishments in various vision applications. Meanwhile, encoding and representation techniques using Bag of Words (BoW) and Vector of Locally Aggregated Descriptors (VLAD) have demonstrated great success for several visual tasks specifically in activity recognition. This work presents a novel human activity recognition technique by combining global and local features of PCANet with BoW and VLAD encoding schemes. Both global and local features are learned by PCANet utilizing selected frames from each action video. After that the dimensionality of these features is decreased via Whitening PCA (WPCA). Then encoding schemes are applied on both features to represent the final descriptors for each action. Ultimately, Support Vector Machines classifier (SVM) is trained for recognition process. Several experiments are conducted on UCF sports dataset to evaluate our method. All experimental results utilizing leave-one-out-cross validation (LOOCV) strategy are satisfactorv and comparable.

Full Text