Abstract

The most widely used activation functions in current deep feed-forward neural networks are rectified linear units (ReLU), besides many alternatives. However, none of these alternatives have managed to consistently outperform the rest and there is no unified theory connecting properties of the task and networks with properties of activation functions in the sense of efficient training. In order to understand the related problems fundamentally, it is necessary to figure out possible causes of gradient instability mathematically, and how different activation functions can be adopted to improve system performances. Theoretical analysis about gradient instability is given in the paper, as well as the fundamental explanation for the exploding/vanishing gradient and the possible solutions. The performances of different activation functions in a given example network are investigated. Numerical simulations suggest that the convergence rate of gradient varies with the activation function. There is no activation function that performs well to all structures. The findings in the paper provide a reference for the selection of activation function in the design of deep neural network models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call