Learning Class-Specific Features with Class Regularization for Videos

Alexandros Stergiou,Ronald Poppe,Remco C Veltkamp

doi:10.3390/app10186241

Abstract

One of the main principles of Deep Convolutional Neural Networks (CNNs) is the extraction of useful features through a hierarchy of kernels operations. The kernels are not explicitly tailored to address specific target classes but are rather optimized as general feature extractors. Distinction between classes is typically left until the very last fully-connected layers. Consequently, variances between classes that are relatively similar are treated the same way as variations between classes that exhibit great dissimilarities. In order to directly address this problem, we introduce Class Regularization, a novel method that can regularize feature map activations based on the classes of the examples used. Essentially, we amplify or suppress activations based on an educated guess of the given class. We can apply this step to each minibatch of activation maps, at different depths in the network. We demonstrate that this improves feature search during training, leading to systematic improvement gains on the Kinetics, UCF-101, and HMDB-51 datasets. Moreover, Class Regularization establishes an explicit correlation between features and class, which makes it a perfect tool to visualize class-specific features at various network depths.

Highlights

Video-based action recognition has seen tremendous progress since the introduction of Convolutional Neural Networks (CNNs) [1,2]
We propose Class Regularization, a regularization method applied in spatiotemporal CNNs
We demonstrate the merits of Class Regularization on the action recognition classification performance on three benchmark datasets, and using a number of widely used CNN architectures

Summary

Introduction

Video-based action recognition has seen tremendous progress since the introduction of Convolutional Neural Networks (CNNs) [1,2]. Kernels in early layers focus on simple textures and patterns, while deeper layers focus on more complex parts of objects or scenes. As these features become more dependent on the different weighting of neural connections in previous layers, only a portion of them becomes descriptive for a specific class [3,4]. Much of the discriminative nature of CNNs is achieved only in the very last fully-connected layers. This hinders easy interpretation of the part of the network that is informative for a specific class

Objectives

Results

Conclusion