Abstract

AbstractEfficiency plays a key role in video understanding modeling, and developing more efficient spatiotemporal deep networks is a key ingredient for enabling their usage in production scenarios. In this work, we propose a methodology for reducing the computational complexity of a video understanding backbone while limiting the drop in accuracy caused by architectural changes. Our approach, named, Progressive Architecture Shrinkage, applies a sequence of reduction operators to the hyperparameters of a network to reduce its computational footprint. The choice of the sequence of operations is automatically optimized in a coordinate‐descent schema, and the approach transfers knowledge from both the initial network and previous stages of the shrinking process by employing a Knowledge Distillation and an adaptive fine‐tuning strategy. As each iteration of the shrinking algorithm requires to train a large‐scale video understanding network, we perform experiments on MARCONI 100—a supercomputer equipped with an IBM Power9 architecture and Volta NVIDIA GPUs. Experimental evaluations are conducted using two backbones and three different action recognition benchmarks. We show that, through our approach, high accuracy levels can be maintained while reducing the number of multiply–adds operations by four times with respect to the original architectures. Code will be made available.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call