Although context-dependent DNN-HMM systems have achieved significant improvements over GMM-HMM systems, significant performance degradation has been observed if the acoustic condition of the test data mismatches that of the training data. Hence, adaptation and adaptive training of DNN are of great research interest. Previous DNN adaptation works mainly focus on adapting parameters of a single DNN by applying linear transformations to feature or hidden-layer output; introducing vector representation of non-speech variability into the input. In these methods, large number of parameters are required to be estimated during adaptation. In this paper, the cluster adaptive training (CAT) framework is employed for DNN adaptive training. Here, multiple weight matrices are constructed to form the basis of a canonical parametric space. During adaptation, for a new acoustic condition, an interpolation vector is estimated to combine the weight basis into a single adapted weight matrix. Since only the interpolation vector need to be estimated during adaptation, the number of updated parameters is much smaller than existing DNN adaptation methods. The CAT-DNN approach was evaluated on an English switchboard task in unsupervised adaptation mode. It achieved significant WER reductions over the unadapted DNN-HMM, relative 7.6% to 10.6%, with only 10 parameters.
Read full abstract