Sequence-discriminative training of deep neural networks
Sequence-discriminative training of deep neural networks (DNNs) is investigated on a standard 300 hour American En- glish conversational telephone speech task. Different sequence- discriminative criteria — maximum mutual information (MMI), minimum phone error (MPE), state-level minimum Bayes risk (sMBR), and boosted MMI — are compared. Two different heuristics are investigated to improve the performance of the DNNs trained using sequence-based criteria — lattices are re- generated after the first iteration of training; and, for MMI and BMMI, the frames where the numerator and denominator hy- potheses are disjoint are removed from the gradient compu- tation. Starting from a competitive DNN baseline trained us- ing cross-entropy, different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on aver- age. Little difference is noticed between the different sequence- based criteria that are investigated. The experiments are done using the open-source Kaldi toolkit, which makes it possible for the wider community to reproduce these results. Index Terms: speech recognition, deep learning, sequence- criterion training, neural networks, reproducible research
- Research Article
172
- 10.1109/tai.2021.3067574
- May 4, 2021
- IEEE Transactions on Artificial Intelligence
A variety of methods have been applied to the architectural configuration and learning or training of artificial deep neural networks (DNN). These methods play a crucial role in the success or failure of the DNN for most problems and applications. Evolutionary algorithms (EAs) are gaining momentum as a computationally feasible method for the automated optimization of DNNs. Neuroevolution is a term, which describes these processes of automated configuration and training of DNNs using EAs. While many works exist in the literature, no comprehensive surveys currently exist focusing exclusively on the strengths and limitations of using neuroevolution approaches in DNNs. Absence of such surveys can lead to a disjointed and fragmented field preventing DNNs researchers potentially adopting neuroevolutionary methods in their own research, resulting in lost opportunities for wider application within real-world deep learning problems. This article presents a comprehensive survey, discussion, and evaluation of the state-of-the-art in using EAs for architectural configuration and training of DNNs. This article highlights the most pertinent current issues and challenges in neuroevolution and identifies multiple promising future research directions. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>Impact Statement—</i>The concept of deep learning originated from the study of artificial neural networks (ANNs). ANNs have achieved extraordinary results in a variety of diverse application areas. Numerous methods have been applied to the architectural configuration and learning or training of artificial DNN and these methods play a crucial role in the success or failure of the DNN for most problems and applications. Recently, EAs have been gaining momentum as a computationally feasible method (called neuroevolution) for the automated configuration and learning or training of DNNs. This article reviews over 170 recent scientific papers describing how major EAs paradigms are being applied by researchers to the configuration and optimization of multiple DNNs. By articulating a clear understanding of the context, state-of-the-art, and feasibility of Neuroevolution, researchers in AI, EAs, and DNN will benefit from this article. The impact of this article comes from contributing toward enhancing research capacity, knowledge, and skills for researchers currently working in neuroevolution and actively engaging those considering becoming involved in this area.
- Conference Article
3
- 10.1145/3613424.3623779
- Oct 28, 2023
Neural network training is inherently sequential where the layers finish the forward propagation in succession, followed by the calculation and back-propagation of gradients (based on a loss function) starting from the last layer. The sequential computations significantly slow down neural network training, especially the deeper ones. Prediction has been successfully used in many areas of computer architecture to speed up sequential processing. Therefore, we propose ADA-GP, which uses gradient prediction adaptively to speed up deep neural network (DNN) training while maintaining accuracy. ADA-GP works by incorporating a small neural network to predict gradients for different layers of a DNN model. ADA-GP uses a novel tensor reorganization method to make it feasible to predict a large number of gradients. ADA-GP alternates between DNN training using backpropagated gradients and DNN training using predicted gradients. ADA-GP adaptively adjusts when and for how long gradient prediction is used to strike a balance between accuracy and performance. Last but not least, we provide a detailed hardware extension in a typical DNN accelerator to realize the speed up potential from gradient prediction. Our extensive experiments with fifteen DNN models show that ADA-GP can achieve an average speed up of 1.47 × with similar or even higher accuracy than the baseline models. Moreover, it consumes, on average, 34% less energy due to reduced off-chip memory accesses compared to the baseline accelerator.
- Book Chapter
1
- 10.1007/978-981-10-8438-6_24
- Jan 1, 2018
This paper introduces the work on automatic speech recognition (ASR) of Myanmar spontaneous speech. The recognizer is based on the Gaussian Mixture and Hidden Markov Model (GMM-HMM). A baseline ASR is developed with 20.5 h of spontaneous speech corpus and refine it with many speaker adaptation methods. In this paper, five kinds of adapted acoustic models were explored; Maximum A Posteriori (MAP), Maximum Mutual Information (MMI), Minimum Phone Error (MPE), Maximum Mutual Information including feature space and model space (fMMI) and Subspace GMM (SGMM). We evaluate these adapted models using spontaneous evaluation set consists of 100 utterances from 61 speakers totally 23 min and 19 s. Experiments on this speech corpus show significant improvement of speaker adaptative training models and SGMM-based acoustic model performs better than other adaptative models. It can significantly reduce 3.16% WER compared with the baseline GMM model. It is also investigated that the Deep Neural Network (DNN) training on the same corpus and evaluated with same evaluation set. With respect to the DNN training, the result reaches up to 31.5% WER.
- Conference Article
8
- 10.1109/icpads47876.2019.00068
- Dec 1, 2019
Deep neural network (DNN) training is generally performed by cloud computing platforms. However, cloud-based training has several problems such as network bottleneck, server management cost, and privacy. To overcome these problems, one of the most promising solutions is distributed DNN model training which trains the model with not only high-performance servers but also low-end power-efficient mobile edge or user devices. However, due to the lack of a framework which can provide an optimal cluster configuration (i.e., determining which computing devices participate in DNN training tasks), it is difficult to perform efficient DNN model training considering DNN service providers' preferences such as training time or energy efficiency. In this paper, we introduce a novel framework for distributed DNN training that determines the best training cluster configuration with available heterogeneous computing resources. Our proposed framework utilizes pre-training with a small number of training steps and estimates training time, power, energy, and energy-delay product (EDP) for each possible training cluster configuration. Based on the estimated metrics, our framework performs DNN training for the remaining steps with the chosen best cluster configurations depending on DNN service providers' preferences. Our framework is implemented in TensorFlow and evaluated with three heterogeneous computing platforms and five widely used DNN models. According to our experimental results, in 76.67% of the cases, our framework chooses the best cluster configuration depending on DNN service providers' preferences with only a small training time overhead.
- Conference Article
3
- 10.1109/hipc56025.2022.00017
- Dec 1, 2022
Deep Learning (DL) has become a prominent machine learning technique due to the availability of efficient computational resources in the form of Graphics Processing Units (GPUs), large-scale datasets and a variety of models. The newer generation of GPUs are being designed with special emphasis on optimizing performance for DL applications. Also, the availability of easy-to-use DL frameworks—like PyTorch and TensorFlow— has enhanced productivity of domain experts to work on their custom DL applications from diverse domains. However, existing Deep Neural Network (DNN) training approaches may not fully utilize the newly emerging powerful GPUs like the NVIDIA A100—this is the primary issue that we address in this paper. Our motivating analyses show that the GPU utilization on NVIDIA A100 can be as low as 43% using traditional DNN training approaches for small-to-medium DL models and input data size. This paper proposes AccDP—a data-parallel distributed DNN training approach—to accelerate GPU-based DL applications. AccDP exploits the Message Passing Interface (MPI) communication library coupled with the NVIDIA’s Multi-Process Service (MPS) to increase the amount of work assigned to parallel GPUs resulting in higher utilization of compute resources. We evaluate our proposed design on different small-to-medium DL models and input sizes on the state-of-the-art HPC clusters. By injecting more parallelism into DNN training using our approach, the evaluation shows up to 58% improvement in training performance on a single GPU and up to 62% on 16 GPUs compared to regular DNN training. Furthermore, we conduct an in-depth characterization to determine the impact of several DNN training factors and best practices—including the batch size and the number of data loading workers— to optimally utilize GPU devices. To the best of our knowledge, this is the first work that explores the use of MPS and MPI to maximize the utilization of GPUs in distributed DNN training.
- Research Article
6
- 10.1109/tifs.2023.3273169
- Jan 1, 2023
- IEEE Transactions on Information Forensics and Security
Recently deep-learning (DL) techniques have been widely adopted in side-channel power analysis. A DL-assisted SCA generally consists of two phases: a deep neural network (DNN) training phase and a follow-on attack phase using the trained DNN. However, currently the two phases are not well aligned, as there is no conclusion on what metric used in the training can result in the most effective attack in the second phase. When traditional loss functions such as negative log-likelihood (NLL) are used in training a DNN, the trained model does not yield optimal follow-on attack. Recently some information theoretical SCA leakage metrics are proposed, either as the validation metric to stop the DNN training with traditional loss functions, or as both the validation metric and the training loss function. None of those proposed metrics, however, directly measures the SCA effectiveness. We propose to conduct DNN training directly with a common SCA effectiveness metric, Guessing Entropy (GE). We overcome the prior practical difficulty of using GE in DNN training by utilizing the GEEA estimation algorithm introduced in CHES 2020. We show that using GEEA as either the validation metric or the loss function produces DNN models that lead to much more effective follow-on attacks. Our work consolidates the DL-assisted SCA framework with a consistent metric, which shows great potential to be adopted as the universal SCA-oriented DNN training framework.
- Conference Article
8
- 10.1109/asru.2013.6707750
- Dec 1, 2013
Minimum phone error (MPE) training of deep neural networks (DNN) is an effective technique for reducing word error rate of automatic speech recognition tasks. This training is often carried out using a Hessian-free (HF) quasi-Newton approach, although other methods such as stochastic gradient descent have also been applied successfully. In this paper we present a novel stochastic approach to HF sequence training inspired by recently proposed stochastic average gradient (SAG) method. SAG reuses gradient information from past updates, and consequently simulates the presence of more training data than is really observed for each model update. We extend SAG by dynamically weighting the contribution of previous gradients, and by combining it to a stochastic HF optimization. We term the resulting procedure DSAG-HF. Experimental results for training DNNs on 1500h of audio data show that compared to baseline HF training, DSAG-HF leads to better held-out MPE loss after each model parameter update, and converges to an overall better loss value. Furthermore, since each update in DSAG-HF takes place over smaller amount of data, this procedure converges in about half the time as baseline HF sequence training.
- Conference Article
63
- 10.1109/hpca53966.2022.00067
- Apr 1, 2022
Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values. In this paper, we propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precision BFP input operands, enabling incremental increases in DNN precision throughout training. By increasing the BFP precision across both training iterations and DNN layers, FAST can greatly shorten the training time while reducing overall hardware resource usage. Our FAST Multipler-Accumulator (fMAC) supports dot product computations under multiple BFP precisions. We validate our FAST system on multiple DNNs with different datasets, demonstrating a 2-6× speedup in training on a single-chip platform over prior work based on mixed-precision or block floating point number systems while achieving similar performance in validation accuracy.
- Research Article
- 10.1016/j.csl.2015.08.001
- Aug 19, 2015
- Computer Speech & Language
Differenced maximum mutual information criterion for robust unsupervised acoustic model adaptation
- Research Article
179
- 10.1109/tnnls.2018.2876179
- Nov 9, 2018
- IEEE Transactions on Neural Networks and Learning Systems
Batch normalization (BN) has recently become a standard component for accelerating and improving the training of deep neural networks (DNNs). However, BN brings in additional calculations, consumes more memory, and significantly slows down the training iteration. Furthermore, the nonlinear square and sqrt operations in the normalization process impede low bit-width quantization techniques, which draw much attention to the deep learning hardware community. In this paper, we propose an L1 -norm BN (L1BN) with only linear operations in both forward and backward propagations during training. L1BN is approximately equivalent to the conventional L2 -norm BN (L2BN) by multiplying a scaling factor that equals (π/2)1/2 . Experiments on various convolutional neural networks and generative adversarial networks reveal that L1BN can maintain the same performance and convergence rate as L2BN but with higher computational efficiency. In real application-specified integrated circuit synthesis with reduced resources, L1BN achieves 25% speedup and 37% energy saving compared to the original L2BN. Our hardware-friendly normalization method not only surpasses L2BN in speed but also simplifies the design of deep learning accelerators. Last but not least, L1BN promises a fully quantized training of DNNs, which empowers future artificial intelligence applications on mobile devices with transfer and continual learning capability.
- Conference Article
4
- 10.1145/3629526.3645035
- May 7, 2024
First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it to be used even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs---information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3% without requiring users to provide job scalability information.
- Conference Article
5
- 10.1109/icassp.2012.6288981
- Mar 1, 2012
Recently feature compensation techniques that train feature transforms using a discriminative criterion have attracted much interest in the speech recognition community. Typically, the acoustic feature space is modeled by a Gaussian mixture model (GMM), and a feature transform is assigned to each Gaussian of the GMM. Feature compensation is then performed by transforming features using the transformation associated with each Gaussian, then summing up the transformed features weighted by the posterior probability of each Gaussian. Several discriminative criteria have been investigated for estimating the feature transformation parameters including maximum mutual information (MMI) and minimum phone error (MPE). Recently, the differenced MMI (dMMI) criterion that generalizes MMI andMPE, has been shown to provide competitive performance for acoustic model training. In this paper, we investigate the use of the dMMI criterion for discriminative feature transforms and demonstrate in a noisy speech recognition experiment that dMMI achieves recognition performance superior to that of MMI or MPE.
- Conference Article
11
- 10.1109/iws52775.2021.9499638
- May 23, 2021
- 2021 IEEE MTT-S International Wireless Symposium (IWS)
This paper discusses the training of deep neural networks (DNNs) for electromagnetic problems. The main concerns include how to modify EM problems to take the advantage of the deep learning techniques and how to tailor conventional deep learning concepts with electromagnetic domain knowledge, which has been overlooked by most existing DNN based EM research. A 1×8 patch antenna array has been adopted as the test vehicle for investigation, with the aim to use deep learning for radiation pattern synthesis. It is analyzed via electromagnetic simulation first to collect sufficient training data sets containing different combinations of excitation signals and corresponding radiation patterns. These data are then pre-processed and passed to DNNs for training to imitate the mapping between excitation signals and radiation patterns. With careful feature selection and DNN architecture optimizations, two DNN models are obtained eventually. One of them aims at forward radiation synthesis in any certain excitation condition, and the other seeks out backward excitation signals needed for a given radiation pattern, and both achieved an accuracy over 80%. This paper may provide enlightenment and reference in applying deep learning to electromagnetic problems in terms of feature selection and architecture modification.
- Research Article
1
- 10.14429/dsj.74.19475
- Nov 25, 2024
- Defence Science Journal
Deep learning techniques have shown remarkable success in radar identification. However, deep neural network training can be time and resource intensive. Batch normalization is a popular approach for quickening deep feed-forward neural network training. The training of deep neural networks is accelerated by minimizing the internal covariate shift and stabilizing the training process by normalizing the intermediate activations within each mini-batch. In this research, the convergence behavior of networks with and without batch normalization is compared. Batch normalization standardizes the input to a layer for each mini-batch applied to either the activations of a prior layer or inputs directly. Our experiments indicate that batch normalization is effective in improving a variety of neural network properties. The results show that batch-normalized models have higher test and validation accuracies across all datasets, which we attribute to their regularizing impact and more steady gradient propagation. This research also examines the impact of several parameters, such as batch size, momentum, and beta and gamma parameters, on the effectiveness of DNNs with batch normalization. The radar dataset used for training is the fused emitter set obtained after feature level fusion of the tracks intercepted by ESM (Electronic Support) and ELINT (Electronic Intelligence) system.
- Research Article
52
- 10.1109/tvlsi.2021.3063543
- Mar 31, 2021
- IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Deep neural networks (DNNs) have gained tremendous popularity in recent years due to their ability to achieve superhuman accuracy in a wide variety of machine learning tasks. However, the compute and memory requirements of DNNs have grown rapidly, creating a need for energy-efficient hardware. Resistive crossbars have attracted significant interest in the design of the next generation of DNN accelerators due to their ability to natively execute massively parallel vector-matrix multiplications within dense memory arrays. However, crossbar-based computations face a major challenge due to device and circuit-level nonidealities, which manifest as errors in the vector-matrix multiplications and eventually degrade DNN accuracy. To address this challenge, there is a need for tools that can model the functional impact of nonidealities on DNN training and inference. Existing efforts toward this goal are either limited to inference or are too slow to be used for large-scale DNN training. We propose TxSim, a fast and customizable modeling framework to functionally evaluate DNN training on crossbar-based hardware considering the impact of nonidealities. The key features of TxSim that differentiate it from prior efforts are: 1) it comprehensively models nonidealities during all training operations (forward propagation, backward propagation, and weight update) and 2) it achieves computational efficiency by mapping crossbar evaluations to well-optimized Basic Linear Algebra Subprograms (BLAS) routines and incorporates speedup techniques to further reduce simulation time with minimal impact on accuracy. TxSim achieves 6×- 108× improvement in simulation speed over prior works, and thereby makes it feasible to evaluate the training of large-scale DNNs on crossbars. Our experiments using TxSim reveal that the accuracy degradation in DNN training due to nonidealities can be substantial (3%-36.4%) for large-scale DNNs and data sets, underscoring the need for further research in mitigation techniques. We also analyze the impact of various device and circuit-level parameters and the associated nonidealities to provide key insights that can guide the design of crossbar-based DNN training accelerators.