Understanding Decoupled and Early Weight Decay

Johan Bjorck,Kilian Q Weinberger,Carla Gomes

doi:10.1609/aaai.v35i8.16837

Johan Bjorck, Kilian Q Weinberger + Show 1 more

Open Access

https://doi.org/10.1609/aaai.v35i8.16837

Copy DOI

Abstract

Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training in computer vision, upending traditional wisdom. Loshchilov et al. show that for adaptive optimizers, manually decaying weights can outperform adding an l2 penalty to the loss. This technique has become increasingly popular and is referred to as decoupled WD. The goal of this paper is to investigate these two recent empirical observations. We demonstrate that by applying WD only at the start, the network norm stays small throughout training. This has a regularizing effect as the effective gradient updates become larger. However, traditional generalizations metrics fail to capture this effect of WD, and we show how a simple scale-invariant metric can. We also show how the growth of network weights is heavily influenced by the dataset and its generalization properties. For decoupled WD, we perform experiments in NLP and RL where adaptive optimizers are the norm. We demonstrate that the primary issue that decoupled WD alleviates is the mixing of gradients from the objective function and the l2 penalty in the buffers of Adam (which stores the estimates of the first-order moment). Adaptivity itself is not problematic and decoupled WD ensures that the gradients from the l2 term cannot "drown out" the true objective, facilitating easier hyperparameter tuning.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Understanding Decoupled and Early Weight Decay

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: May 18, 2021
Citations: 8

Similar Papers

L1, Lp, L2, and elastic net penalties for regularization of Gaussian component distributions in magnetic resonance relaxometry
Christiana Sabett ... Kyle Sexton
Concepts in Magnetic Resonance Part A | VOL. 46A
Christiana Sabett, et. al.Christiana Sabett ... Kyle Sexton
01 Mar 2017
Concepts in Magnetic Resonance Part A | VOL. 46A

An antinoise sparse representation method for robust face recognition via joint l1 and l2 regularization
Shaoning Zeng ... Lunman Deng
Expert Systems with Applications | VOL. 82
Shaoning Zeng, et. al.Shaoning Zeng ... Lunman Deng
01 Apr 2017
Expert Systems with Applications | VOL. 82

Linear Regression Modelling on Epigallocatechin-3-gallate Sensor Data for Green Tea
Angiras Modak ... Trisita Nandy Chatterjee
-
Angiras Modak, et. al.Angiras Modak ... Trisita Nandy Chatterjee
01 Nov 2018
01 Nov 2018

Analyzing fusion of regularization techniques in the deep learning‐based intrusion detection system
Ankit Thakkar ... Ritika Lohiya
International Journal of Intelligent Systems | VOL. 36
Ankit Thakkar, et. al.Ankit Thakkar ... Ritika Lohiya
04 Aug 2021
International Journal of Intelligent Systems | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Understanding Decoupled and Early Weight Decay

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence