Abstract

A key strategy to enable training of deep neural networks is to use non-saturating activation functions to reduce the vanishing gradient problem. Popular choices that saturate only in the negative domain are the rectified linear unit (ReLU), its smooth, non-linear variant, Softplus, and the exponential linear units (ELU and SELU). Other functions are non-saturating across the entire real domain, like the linear parametric ReLU (PReLU). Here we introduce a nonlinear activation function called Soft++ that extends PReLU and Softplus, parametrizing the slope in the negative domain and the exponent. We test identical network architectures with ReLU, PReLU, Softplus, ELU, SELU, and Soft++ on several machine learning problems and find that: i) convergence of networks with any activation function depends critically on the particular dataset and network architecture, emphasizing the need for parametrization, which allows to adapt the activation function to the particular problem; ii) non-linearity around the origin improves learning and generalization; iii) in many cases, non-saturation across the entire real domain further improves performance. On very difficult learning problems with deep fully-connected and convolutional networks, Soft++ outperforms all other activation functions, accelerating learning and improving generalization. Its main advantage lies in its dual parametrization, offering flexible control of the shape and gradient of the function.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call