Abstract

Eigenphone-based speaker adaptation outperforms conventional maximum likelihood linear regression (MLLR) and eigenvoice methods when there is sufficient adaptation data. However, it suffers from severe over-fitting when only a few seconds of adaptation data are provided. In this paper, various regularization methods are investigated to obtain a more robust speaker-dependent eigenphone matrix estimation. Element-wise l1 norm regularization (known as lasso) encourages the eigenphone matrix to be sparse, which reduces the number of effective free parameters and improves generalization. Squared l2 norm regularization promotes an element-wise shrinkage of the estimated matrix towards zero, thus alleviating over-fitting. Column-wise unsquared l2 norm regularization (known as group lasso) acts like the lasso at the column level, encouraging column sparsity in the eigenphone matrix, i.e., preferring an eigenphone matrix with many zero columns as solution. Each column corresponds to an eigenphone, which is a basis vector of the phone variation subspace. Thus, group lasso tries to prevent the dimensionality of the subspace from growing beyond what is necessary. For nonzero columns, group lasso acts like a squared l2 norm regularization with an adaptive weighting factor at the column level. Two combinations of these methods are also investigated, namely elastic net (applying l1 and squared l2 norms simultaneously) and sparse group lasso (applying l1 and column-wise unsquared l2 norms simultaneously). Furthermore, a simplified method for estimating the eigenphone matrix in case of diagonal covariance matrices is derived, and a unified framework for solving various regularized matrix estimation problems is presented. Experimental results show that these methods improve the adaptation performance substantially, especially when the amount of adaptation data is limited. The best results are obtained when using the sparse group lasso method, which combines the advantages of both the lasso and group lasso methods. Using speaker-adaptive training, performance can be further improved.

Highlights

  • Model space speaker adaptation is an important technique in modern speech recognition system

  • We focus on the speaker adaptation of a conventional hidden Markov model Gaussian mixture model (HMM-GMM)based speech recognition system

  • 6 Conclusion In this paper, we investigate various regularization methods to improve the robustness of the estimation of the eigenphone matrix in eigenphone-based speaker adaptation

Read more

Summary

Introduction

Model space speaker adaptation is an important technique in modern speech recognition system. A speaker-dependent eigenphone matrix representing the main phone variation patterns for a specific speaker is estimated. The speaker-independent phone coordinate matrix is obtained by principal component analysis (PCA), and speaker adaptation is performed by estimating a set of eigenphones for each unknown speaker using the maximum likelihood criterion. In [6], similar regularization methods are adopted to improve the estimation of state-specific parameters in the subspace Gaussian mixture model (SGMM). We investigate the regularized estimation of the speaker-dependent eigenphone matrix for speaker adaptation. We discuss the phone-space speaker adaptation method, which obtains good performance when the adaptation data is sufficient.

Review of the eigenphone-based speaker adaptation method
Methods
Eigenphone-based speaker adaptation using sparse group lasso
Conclusion
Findings
30. The National Institute of Standards and Technology the NIST Scoring
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call