Towards speaker adaptive training of deep neural network acoustic models

Yajie Miao,Hao Zhang,Florian Metze

doi:10.21437/interspeech.2014-490

Abstract

We investigate the concept of speaker adaptive training (SAT) in the context of deep neural network (DNN) acoustic models. Previous studies have shown success of performing speaker adaptation for DNNs in speech recognition. In this paper, we apply SAT to DNNs by learning two types of feature mapping neural networks. Given an initial DNN model, these networks take speaker i-vectors as additional information and project DNN inputs into a speaker-normalized space. The final SAT model is obtained by updating the canonical DNN in the normalized feature space. Experiments on a Switchboard 110hour setup show that compared with the baseline DNN, the SAT-DNN model brings 7.5% and 6.0% relative improvement when DNN inputs are speaker-independent and speakeradapted features respectively. Further evaluations on the more challenging BABEL datasets reveal significant word error rate reduction achieved by SAT-DNN.

Highlights

deep neural networks (DNNs) have been applied widely to automatic speech recognition (ASR), showing superior performance over the traditional Gaussian mixture acoustic models (GMMs)-HMM models [1, 2]
Training of SATDNN models starts from an initial DNN which has been trained over all the speakers
The first variation lies in the training of i-vector extractors [10], and we study the impact of i-vector training data on the performance of Speaker adaptive training (SAT)-DNN

Summary

Introduction

DNNs have been applied widely to automatic speech recognition (ASR), showing superior performance over the traditional GMM-HMM models [1, 2]. Examples of the solutions include augmenting the speakerindependent DNN with additional layers [3, 4], adapting the activation function [6] and using speaker-adapted feature space [2, 7, 8]. To further resolve this issue, our recent study [9] ported the concept of SAT to DNNs. Training of SATDNN models starts from an initial DNN which has been trained over all the speakers. The goal of this paper is to analyze appropriate settings for the SAT-DNN architecture and explore possible improvements to it

Objectives

Methods

Results

Conclusion