Investigations on hessian-free optimization for cross-entropy training of deep neural networks
Context-dependent deep neural network HMMs have been shown to achieve recognition accuracy superior to Gaussian mixture models in a number of recent works. Typically, neural networks are optimized with stochastic gradient descent. On large datasets, stochastic gradient descent improves quickly during the beginning of the optimization. But since it does not make use of second order information, its asymptotic convergence behavior is slow. In regions with pathological curvature, stochastic gradient descent may almost stagnate and thereby falsely indicate convergence. Another drawback of stochastic gradient descent is that it can only be parallelized within minibatches. The Hessian-free algorithm is a second order batch optimization algorithm that does not suffer from these problems. In a recent work, Hessian-free optimization has been applied to a training of deep neural networks according to a sequence criterion. In that work, improvements in accuracy and training time have been reported. In this paper, we analyze the properties of the Hessian-free optimization algorithm and investigate whether it is suited for cross-entropy training of deep neural networks as well.
- Conference Article
3
- 10.1109/iscslp.2014.6936597
- Sep 1, 2014
Effective training of Deep neural networks (DNNs) has very important significance for the DNNs based speech recognition systems. Stochastic gradient descent (SGD) is the most popular method for training DNNs. SGD often provides the solutions that are well adapt to generalization on held-out data. Recently, Hessian Free (HF) optimization have proved another optional algorithm for training DNNs. HF can be used for solving the pathological tasks. Stochastic Hessian Free (SHF) is a variation of HF, which can combine the generalization advantages of stochastic gradient descent (SGD) with second-order information from Hessian Free. This paper focus on investigating the SHF algorithm for DNN training. We conduct this algorithm on 100 hours Mandarin Chinese recorded speech recognition task. The first experiment shows that choosing proper size of gradient and curvature minibatch results in less training time and good performance. Next, it is observed that the performance of SHF does not depend on the initial parameters. Further more, experimental results shows that SHF performs with comparable results with SGD but better than traditional HF. Finally, we find that additional performance improvement is obtained with a dropout algorithm.
- Conference Article
3
- 10.1145/3613424.3623779
- Oct 28, 2023
Neural network training is inherently sequential where the layers finish the forward propagation in succession, followed by the calculation and back-propagation of gradients (based on a loss function) starting from the last layer. The sequential computations significantly slow down neural network training, especially the deeper ones. Prediction has been successfully used in many areas of computer architecture to speed up sequential processing. Therefore, we propose ADA-GP, which uses gradient prediction adaptively to speed up deep neural network (DNN) training while maintaining accuracy. ADA-GP works by incorporating a small neural network to predict gradients for different layers of a DNN model. ADA-GP uses a novel tensor reorganization method to make it feasible to predict a large number of gradients. ADA-GP alternates between DNN training using backpropagated gradients and DNN training using predicted gradients. ADA-GP adaptively adjusts when and for how long gradient prediction is used to strike a balance between accuracy and performance. Last but not least, we provide a detailed hardware extension in a typical DNN accelerator to realize the speed up potential from gradient prediction. Our extensive experiments with fifteen DNN models show that ADA-GP can achieve an average speed up of 1.47 × with similar or even higher accuracy than the baseline models. Moreover, it consumes, on average, 34% less energy due to reduced off-chip memory accesses compared to the baseline accelerator.
- Research Article
172
- 10.1109/tai.2021.3067574
- May 4, 2021
- IEEE Transactions on Artificial Intelligence
A variety of methods have been applied to the architectural configuration and learning or training of artificial deep neural networks (DNN). These methods play a crucial role in the success or failure of the DNN for most problems and applications. Evolutionary algorithms (EAs) are gaining momentum as a computationally feasible method for the automated optimization of DNNs. Neuroevolution is a term, which describes these processes of automated configuration and training of DNNs using EAs. While many works exist in the literature, no comprehensive surveys currently exist focusing exclusively on the strengths and limitations of using neuroevolution approaches in DNNs. Absence of such surveys can lead to a disjointed and fragmented field preventing DNNs researchers potentially adopting neuroevolutionary methods in their own research, resulting in lost opportunities for wider application within real-world deep learning problems. This article presents a comprehensive survey, discussion, and evaluation of the state-of-the-art in using EAs for architectural configuration and training of DNNs. This article highlights the most pertinent current issues and challenges in neuroevolution and identifies multiple promising future research directions. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>Impact Statement—</i>The concept of deep learning originated from the study of artificial neural networks (ANNs). ANNs have achieved extraordinary results in a variety of diverse application areas. Numerous methods have been applied to the architectural configuration and learning or training of artificial DNN and these methods play a crucial role in the success or failure of the DNN for most problems and applications. Recently, EAs have been gaining momentum as a computationally feasible method (called neuroevolution) for the automated configuration and learning or training of DNNs. This article reviews over 170 recent scientific papers describing how major EAs paradigms are being applied by researchers to the configuration and optimization of multiple DNNs. By articulating a clear understanding of the context, state-of-the-art, and feasibility of Neuroevolution, researchers in AI, EAs, and DNN will benefit from this article. The impact of this article comes from contributing toward enhancing research capacity, knowledge, and skills for researchers currently working in neuroevolution and actively engaging those considering becoming involved in this area.
- Research Article
42
- 10.1016/j.jco.2020.101540
- Nov 27, 2020
- Journal of Complexity
Non-convergence of stochastic gradient descent in the training of deep neural networks
- Conference Article
8
- 10.1109/icpads47876.2019.00068
- Dec 1, 2019
Deep neural network (DNN) training is generally performed by cloud computing platforms. However, cloud-based training has several problems such as network bottleneck, server management cost, and privacy. To overcome these problems, one of the most promising solutions is distributed DNN model training which trains the model with not only high-performance servers but also low-end power-efficient mobile edge or user devices. However, due to the lack of a framework which can provide an optimal cluster configuration (i.e., determining which computing devices participate in DNN training tasks), it is difficult to perform efficient DNN model training considering DNN service providers' preferences such as training time or energy efficiency. In this paper, we introduce a novel framework for distributed DNN training that determines the best training cluster configuration with available heterogeneous computing resources. Our proposed framework utilizes pre-training with a small number of training steps and estimates training time, power, energy, and energy-delay product (EDP) for each possible training cluster configuration. Based on the estimated metrics, our framework performs DNN training for the remaining steps with the chosen best cluster configurations depending on DNN service providers' preferences. Our framework is implemented in TensorFlow and evaluated with three heterogeneous computing platforms and five widely used DNN models. According to our experimental results, in 76.67% of the cases, our framework chooses the best cluster configuration depending on DNN service providers' preferences with only a small training time overhead.
- Research Article
11
- 10.1109/tnnls.2021.3130991
- Sep 1, 2023
- IEEE Transactions on Neural Networks and Learning Systems
Deep neural network (DNN) training is an iterative process of updating network weights, called gradient computation, where (mini-batch) stochastic gradient descent (SGD) algorithm is generally used. Since SGD inherently allows gradient computations with noise, the proper approximation of computing weight gradients within SGD noise can be a promising technique to save energy/time consumptions during DNN training. This article proposes two novel techniques to reduce the computational complexity of the gradient computations for the acceleration of SGD-based DNN training. First, considering that the output predictions of a network (confidence) change with training inputs, the relation between the confidence and the magnitude of the weight gradient can be exploited to skip the gradient computations without seriously sacrificing the accuracy, especially for high confidence inputs. Second, the angle diversity-based approximations of intermediate activations for weight gradient calculation are also presented. Based on the fact that the angle diversity of gradients is small (highly uncorrelated) in the early training epoch, the bit precision of activations can be reduced to 2-/4-/8-bit depending on the resulting angle error between the original gradient and quantized gradient. The simulations show that the proposed approach can skip up to 75.83% of gradient computations with negligible accuracy degradation for CIFAR-10 dataset using ResNet-20. Hardware implementation results using 65-nm CMOS technology also show that the proposed training accelerator achieves up to 1.69x energy efficiency compared with other training accelerators.
- Conference Article
8
- 10.1109/asru.2013.6707750
- Dec 1, 2013
Minimum phone error (MPE) training of deep neural networks (DNN) is an effective technique for reducing word error rate of automatic speech recognition tasks. This training is often carried out using a Hessian-free (HF) quasi-Newton approach, although other methods such as stochastic gradient descent have also been applied successfully. In this paper we present a novel stochastic approach to HF sequence training inspired by recently proposed stochastic average gradient (SAG) method. SAG reuses gradient information from past updates, and consequently simulates the presence of more training data than is really observed for each model update. We extend SAG by dynamically weighting the contribution of previous gradients, and by combining it to a stochastic HF optimization. We term the resulting procedure DSAG-HF. Experimental results for training DNNs on 1500h of audio data show that compared to baseline HF training, DSAG-HF leads to better held-out MPE loss after each model parameter update, and converges to an overall better loss value. Furthermore, since each update in DSAG-HF takes place over smaller amount of data, this procedure converges in about half the time as baseline HF sequence training.
- Conference Article
93
- 10.5555/3018874.3018877
- Nov 13, 2016
This paper presents a theoretical analysis and practical evaluation of the main bottlenecks towards a scalable distributed solution for the training of Deep Neural Networks (DNNs). The presented results show, that the current state of the art approach, using data-parallelized Stochastic Gradient Descent (SGD), is quickly turning into a vastly communication bound problem. In addition, we present simple but fixed theoretic constraints, preventing effective scaling of DNN training beyond only a few dozen nodes. This leads to poor scalability of DNN training in most practical scenarios.
- Conference Article
51
- 10.24963/ijcai.2020/452
- Dec 21, 2018
Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.
- Conference Article
4
- 10.1145/3629526.3645035
- May 7, 2024
First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it to be used even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs---information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3% without requiring users to provide job scalability information.
- Research Article
1
- 10.14429/dsj.74.19475
- Nov 25, 2024
- Defence Science Journal
Deep learning techniques have shown remarkable success in radar identification. However, deep neural network training can be time and resource intensive. Batch normalization is a popular approach for quickening deep feed-forward neural network training. The training of deep neural networks is accelerated by minimizing the internal covariate shift and stabilizing the training process by normalizing the intermediate activations within each mini-batch. In this research, the convergence behavior of networks with and without batch normalization is compared. Batch normalization standardizes the input to a layer for each mini-batch applied to either the activations of a prior layer or inputs directly. Our experiments indicate that batch normalization is effective in improving a variety of neural network properties. The results show that batch-normalized models have higher test and validation accuracies across all datasets, which we attribute to their regularizing impact and more steady gradient propagation. This research also examines the impact of several parameters, such as batch size, momentum, and beta and gamma parameters, on the effectiveness of DNNs with batch normalization. The radar dataset used for training is the fused emitter set obtained after feature level fusion of the tracks intercepted by ESM (Electronic Support) and ELINT (Electronic Intelligence) system.
- Conference Article
14
- 10.1109/icassp.2015.7178917
- Apr 1, 2015
The algorithm of choice for cross-entropy training of deep neural network (DNN) acoustic models is mini-batch stochastic gradient descent (SGD). One of the important decisions for this algorithm is the learning rate strategy (also called stepsize selection). We investigate several existing schemes and propose a new learning rate strategy which is inspired by nonmonotone linesearch techniques in nonlinear optimization and the NewBob algorithm. This strategy was found to be relatively insensitive to poorly tuned parameters and resulted in lower word error rates compared to Newbob on two different LVCSR tasks (English broadcast news transcription 50 hours and Switchboard telephone conversations 300 hours). Further, we discuss some justifications for the method by briefly linking it to results in optimization theory.
- Research Article
6
- 10.1109/tifs.2023.3273169
- Jan 1, 2023
- IEEE Transactions on Information Forensics and Security
Recently deep-learning (DL) techniques have been widely adopted in side-channel power analysis. A DL-assisted SCA generally consists of two phases: a deep neural network (DNN) training phase and a follow-on attack phase using the trained DNN. However, currently the two phases are not well aligned, as there is no conclusion on what metric used in the training can result in the most effective attack in the second phase. When traditional loss functions such as negative log-likelihood (NLL) are used in training a DNN, the trained model does not yield optimal follow-on attack. Recently some information theoretical SCA leakage metrics are proposed, either as the validation metric to stop the DNN training with traditional loss functions, or as both the validation metric and the training loss function. None of those proposed metrics, however, directly measures the SCA effectiveness. We propose to conduct DNN training directly with a common SCA effectiveness metric, Guessing Entropy (GE). We overcome the prior practical difficulty of using GE in DNN training by utilizing the GEEA estimation algorithm introduced in CHES 2020. We show that using GEEA as either the validation metric or the loss function produces DNN models that lead to much more effective follow-on attacks. Our work consolidates the DL-assisted SCA framework with a consistent metric, which shows great potential to be adopted as the universal SCA-oriented DNN training framework.
- Research Article
13
- 10.3390/a14040107
- Mar 28, 2021
- Algorithms
The accurate of i identificationntrinsically disordered proteins or protein regions is of great importance, as they are involved in critical biological process and related to various human diseases. In this paper, we develop a deep neural network that is based on the well-known VGG16. Our deep neural network is then trained through using 1450 proteins from the dataset DIS1616 and the trained neural network is tested on the remaining 166 proteins. Our trained neural network is also tested on the blind test set R80 and MXD494 to further demonstrate the performance of our model. The MCC value of our trained deep neural network is 0.5132 on the test set DIS166, 0.5270 on the blind test set R80 and 0.4577 on the blind test set MXD494. All of these MCC values of our trained deep neural network exceed the corresponding values of existing prediction methods.
- Book Chapter
9
- 10.1007/978-3-030-03493-1_83
- Jan 1, 2018
Artificial neural networks (ANN) again are playing a leading role in machine learning, especially in classification and regression processes, due to the emergence of deep learning (ANNs with more than four hidden layers), allowing them to encode more and more complex features. The increase in the number of hidden layers in ANNs has posed important challenges in their training. Variations (e.g. RMSProp) of classical algorithms such as backpropagation with its stochastic gradient descent are the state of the art for training deep ANNs. However, other research has shown that the advantages of metaheuristics need more detailed study in this area. We summarize the design and use of a framework to optimize learning of deep neural networks in TensorFlow using metaheuristics, a framework implemented in Python that allows training of the networks in CPU or GPU depending on the TensorFlow configuration and allows easy integration of diverse classification and regression problems solved with different neural networks architectures (conventional, convolutional and recurrent) and new metaheuristics. The framework initially includes Particle Swarm Optimization, Global-best Harmony Search, and Differential Evolution. It further enables the conversion of metaheuristics into memetic algorithms including exploitation processes using the algorithms available in TensorFlow: RMSProp, Adam, Adadelta, Momentum, and Adagrad.