Abstract

In this paper, we propose the factorized hidden layer (FHL) approach to adapt the deep neural network (DNN) acoustic models for automatic speech recognition (ASR). FHL aims at modeling speaker dependent (SD) hidden layers by representing an SD affine transformation as a linear combination of bases. The combination weights are low-dimensional speaker parameters that can be initialized using speaker representations like i-vectors and then reliably refined in an unsupervised adaptation fashion. Therefore, our method provides an efficient way to perform both adaptive training and (test-time) adaptation. Experimental results have shown that the FHL adaptation improves the ASR performance significantly, compared to the standard DNN models, as well as other state-of-the-art DNN adaptation approaches, such as training with the speaker-normalized CMLLR features, speaker-aware training using i-vector and learning hidden unit contributions (LHUC). For Aurora 4, FHL achieves 3.8% and 2.3% absolute improvements over the standard DNNs trained on the LDA + STC and CMLLR features, respectively. It also achieves 1.7% absolute performance improvement over a system that combines the i-vector adaptive training with LHUC adaptation. For the AMI dataset, FHL achieved 1.4% and 1.9% absolute improvements over the sequence-trained CMLLR baseline systems, for the IHM and SDM tasks, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call