Wise teachers train better DNN acoustic models

Ryan Price,Ken-Ichi Iso,Koichi Shinoda

doi:10.1186/s13636-016-0088-7

Abstract

Automatic speech recognition is becoming more ubiquitous as recognition performance improves, capable devices increase in number, and areas of new application open up. Neural network acoustic models that can utilize speaker-adaptive features, have deep and wide layers, or more computationally expensive architectures, for example, often obtain best recognition accuracy but may not be suitable for the given budget of computational and storage resources or latency required by the deployed system. We explore a straightforward training approach which takes advantage of highly accurate but expensive-to-evaluate neural network acoustic models by using their outputs to relabel training examples for easier-to-deploy models. Experiments on a large vocabulary continuous speech recognition task offer relative reductions in word error rate of up to 16.7 % over training with the hard aligned labels by effectively making use of large amounts of additional untranscribed data. Somewhat remarkably, the approach works well even when only two output classes are present. Experiments on a voice activity detection task give relative reductions in equal error rate of up to 11.5 % when using a convolutional neural network to relabel training examples for a feedforward neural network. An investigation into the hidden layer weight matrices finds that soft target-trained networks tend to produce weight matrices having fuller rank and slower decay in singular values than their hard target-trained counterparts, suggesting that more of the network's capacity is utilized for learning additional information giving better accuracy.

Highlights

1 Introduction Over the last several years, neural network (NN) acoustic models have become an essential component in many state-of-the-art automatic speech recognition (ASR) systems, with the most accurate NN acoustic models being considerably complex in size and architecture
3.2.3 Experimental results we present the results of soft target training using the Feature-space maximum likelihood linear regression (fMLLR) Deep neural networks (DNNs) baseline described in the previous section as a teacher DNN to provide labels for a student DNN to learn from
All student DNNs start from random initialization because restricted Boltzmann machines (RBM) generative pretraining was not beneficial

Summary

Introduction

Over the last several years, neural network (NN) acoustic models have become an essential component in many state-of-the-art automatic speech recognition (ASR) systems, with the most accurate NN acoustic models being considerably complex in size and architecture. Techniques developed for Gaussian mixture model (GMM)based large vocabulary continuous speech recognition (LVCSR), such as discriminative training, are . Improving DNN noise robustness was explored in [10] by augmenting DNN input vectors with an estimate of the noise present in the signal These approaches require the speaker transforms or augmented features to be present at both training and test time. Statistics used for cepstral mean and variance normalization, as well as fMLLR transform estimation, are computed in batch over a whole conversation side. This approach excludes many applications we are interested in where minimum latency is important and we cannot wait until all speech has been received before starting decoding

Methods

Results

Conclusion