Abstract
L 2 regularization for weights in neural networks is widely used as a standard training trick. In addition to weights, the use of batch normalization involves an additional trainable parameter γ, which acts as a scaling factor. However, L 2 regularization for γ remains an undiscussed mystery and is applied in different ways depending on the library and practitioner. In this article, we study whether L 2 regularization for γ is valid. To explore this issue, we consider two approaches: (1) variance control to make the residual network behave like an identity mapping and (2) stable optimization through the improvement of effective learning rate. Through two analyses, we specify the desirable and undesirable γ to apply L 2 regularization and propose four guidelines for managing them. In several experiments, we observed that applying L 2 regularization to applicable γ increased 1% to 4% classification accuracy, whereas applying L 2 regularization to inapplicable γ decreased 1% to 3% classification accuracy, which is consistent with our four guidelines. Our proposed guidelines were further validated through various tasks and architectures, including variants of residual networks and transformers.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Intelligent Systems and Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.