Abstract
The encoding method is an important factor for an action recognition pipeline. One of the key points for the encoding method is the assignment step. A very widely used super-vector encoding method is the vector of locally aggregated descriptors (VLAD), with very competitive results in many tasks. However, it considers only hard assignment and the criteria for the assignment is performed only from the features side, by looking for which visual word the features are voting. In this work we propose to encode deep features for videos using a double assignment VLAD (DA-VLAD). In addition to the traditional assignment for VLAD we perform a second assignment by taking into account the perspective from the codebook side: which are the nearest features to a visual word and not only which is the nearest centroid for the features as the standard assignment. Another important factor for the performance of an action recognition system is the feature extraction step. Recently, deep features obtained state-of-the-art results in many tasks, being also adopted for action recognition with competitive results over hand-crafted features. This work includes a pipeline to extract local deep features for videos using any available network as a black box and we show competitive results including the case when the network was trained for another task or another dataset. Our DA-VLAD encoding method outperforms the traditional VLAD and we obtain state-of-the-art results on UCF50 dataset and competitive results on UCF101 dataset.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have