Factored front-end CMLLR for joint speaker and environment normalization under DNN-HMM

Shakti P Rath

doi:10.1007/s10772-017-9453-x

Abstract

In this paper Factored front-end CMLLR (F-FE-CMLLR) is investigated for the task of joint speaker and environment normalization in the frame-work of DNN-HMM acoustic modeling. It is a feature-space transform comprising of the composition of front-end CMLLR for environment and global CMLLR for speaker normalizations. The transform is applied to the input noisy, speaker-independent features and the resulting canonical features are passed on to the DNN-HMM for training and decoding. Two estimation procedures for F-FE-CMLLR are investigated, namely, sequential and iterative training. One of the key attributes of F-FE-CMLLR is that in the iterative training paradigm it is likely to foster acoustic factorization, which enables more effective transfer of the environment transform from one condition to another. Moreover, being a feature space transform, it becomes straightforward to use it in the context of DNN-HMM acoustic modeling. The performance of the proposed scheme is evaluated on the Aurora-4 noisy speech recognition task. The dominant acoustic factors in the task are the microphone variability, additive noise with varying SNRs and speakers. It is shown that F-FE-CMLLR yields a large improvement in performance compared to the baseline features, which are processed with CMLLR for speaker adaptive training (SAT). The improvement is observed in all acoustic conditions existing in the test sets. Moreover, the iterative training of F-FE-CMLLR outperforms sequential training under all test conditions. Specifically, when all three type of acoustic conditions co-exist, the sequential training yields a 13% relative improvement over SAT features. The iterative training provides an additional improvement on the top, amounting to an 18% relative gain over-all. It is argued that the improvement over sequential training is observed due to acoustic factorization that holds in an implicit sense.

Full Text