Loss Curve Approximations for Fast Neural Architecture Ranking &amp; Training Elasticity Estimation

Dan Zhao,Nathan C. Frey,Siddharth Samsi,Vijay Gadepally

doi:10.1109/ipdpsw55747.2022.00123

Abstract

Two key questions for any deep learning task involve questions around its optimization. First, when should we stop training or, alternatively, how long should we train for before the gains are not worth the continued training (i.e. early or optimal stopping)? Secondly, what is the “right” or best model: what training settings, hyper-parameters, and model architecture are best for the task at hand to maximize performance (i.e. architecture search)? Though essential, these questions are arguably also the most expensive parts of deep learning experimentation and the most unclear. Moreover, these expensive, exhaustive searches require large computational budgets that can carry large environmental footprints and significant energy expenditure. In this paper, we introduce a new method we call the Loss Curve Gradient Approximation (LCGA) that ranks model performance with minimal training. Using a wide variety of popular deep vision models, we test its predictive power and performance across different neural architectures and training settings. For a comparative analysis, we benchmark the performance of LCGA against an existing technique, Training Speed Estimation (TSE), used in architecture search and performance ranking and show that LCGA can significantly outperform TSE while still holding the same advantages in terms of ease, speed, and efficiency. Lastly, we describe potential applications of LCGA beyond its primary application: namely, (1) combining collected experimental data with LCGA to develop train-less NAS and (2) presenting a framework to more rigorously guide early stopping in training by borrowing concepts of demand elasticity from economics.

Full Text