On Neural Network Activation Functions and Optimizers in Relation to Polynomial Regression

Rituparna Datta,John Pomerat,Aviv Segev

doi:10.1109/bigdata47090.2019.9005674

Abstract

Recently, research in machine learning has become more reliant on data-driven approaches. However, understanding the general theory behind optimal neural network architecture is, arguably, just as important. With the proliferation of deep learning and neural networks, finding optimal neural network architecture is vital for both accuracy and performance. Recently, extensive research on neural network architecture has been performed [3],[4]. Additionally, while there has been plenty of research on hidden layer neural network architecture [2], activation functions are often not considered. In a network, an activation function defines the output of a neuron and introduces non-linearities into the neural network, enabling it to be a universal function approximator [12]. In terms of activation functions, one significant paper is Krizhevsky’s seminole work on ImageNet classification and the creation of the ReLU activation function [1]. In the paper, Krizhevsky outlines the construction of an image recognition model using the Rectified Linear Unit activation function (ReLU) for the ImageNET LSVRC-2010 competition which outperformed the state-of-the-art image recognition systems at the time [1]. Since then, ReLU has increased in popularity. In their 2018 conference paper, Bircanoglu and Arica, with the assistance of 231 distinct training procedures, found ReLU to be the best general activation function [12]. In addition to comparisons of activation functions, Nwankpa, Ijomah, Gachagan, and Marshall conducted a meta analysis of the field of research centered around activation functions and found ReLU to be the most popular activation function choice [5]. In terms of optimizers, gradient descent has historically been the most popular loss optimization algorithm, but with Kingma and Ba’s 2014 paper [8], Adam: A Method for Stochastic Optimization, Adam optimizer is slowly becoming the industry standard [11]. In their paper, Kingma and Ba cleverly combine momentum descent, RMSprop, and Adagrad optimization into one algorithm, Adam (or adaptive moment estimation) [8]. In addition to Adam, there are plenty of other optimizers to choose from, including gradient descent, RMSprop [9], Adagrad [10], and Adadelta [7]. Recently, many breakthroughs have been made in terms of neural network performance, improved GPU performance and adaptation to deep learning tasks has created massive efficiency increases for the whole field of machine learning. Furthermore, as machine learning becomes increasingly optimized, the importance of efficiency improvements will continue to rise. Thus, understanding the optimal activation function and optimizer choice for a neural network is relevant. The goal of this paper is to make comparisons between activation functions, optimizers, and, more generally, entire neural network architectures, through measured error in a training environment. In this paper, we examine the performance of a wide variety of neural network configurations on randomly generated polynomial data sets of fixed degree. To do this, we compare various neural network activation functions and optimizers while controlling for hidden layer configurations and degree of the underlying polynomial dataset. Curiously, we find that the Sigmoid activation function is more accurate than ReLU and Tanh for regression tasks on low-featured polynomial data. We also reach the same conclusion regarding Stochastic Gradient Descent (SGD) in comparison to the Adam optimization function and Root Mean Square Propagation (RMSprop). Additionally, we observe that SGD is more efficient in the short term for finding local minimums than Adam or RMSprop; however, after sufficiently many epochs, performance differences between the optimizers vanished.

Full Text