Linear discriminant - a new criterion for speaker normalization

Martin Westphal,Alex Waibel,Tanja Schultz

doi:10.21437/icslp.1998-433

Abstract

In Vocal Tract Length Normalization (VTLN) a linear or nonlinear frequency transformation compensates for different vocal tract lengths. Finding good estimates for the speaker specific warp parameters is a critical issue. Despite good results using the Maximum Likelihood criterion to find parameters for a linear warping, there are concerns using this method. We searched for a new criterion that enhances the inter-class separability in addition to optimizing the distribution of each phonetic class. Using such a criterion, Linear Discriminant Analysis determines a linear transformation in a lower dimensional space. For VTLN, we keep the dimension constant and warp the training samples of each speaker such that the Linear Discriminant is optimized. Although that criterion depends on all training samples of all speakers it can iteratively provide speaker specific warp factors. We discuss how this approach can be applied in speech recognition and present first results on two different recognition tasks. 1 Speaker Normalization using VTLN Vocal Tract Length Normalization (VTLN) has proven to decrease the word error rate of a speech recognition system, compared to systems not using such an approach to reduce the variability introduced by different speakers. The main effect addressed here is a shift of the formant frequencies of the speakers caused by their different vocal tract lengths. Two issues have been investigated. The first is how to map one speaker’s spectrum on that of a “standard” or average speaker, depending on a warp parameter which is correlated with the vocal tract length. The other issue is how to find an appropriate warp parameter for each speaker. Most studies assume that the same algorithm is used for training and test, but this is not always necessary. [Acero (1990)] has used a bilinear transform with one speaker dependent parameter. In a first attempt he observed that the algorithm chose a degenerate case where all input frames are transformed into a constant. Therefore, he enforced a constant average warping parameter over all speakers. Modeling the vocal tract as a uniform tube of length L, the formant frequencies are proportional to 1/L. Therefore, some approaches use a linear warp of the frequency scale to normalize speakers. The warp can be performed in the time or spectral domain. In the latter case, a new spectrum is derived by interpolation or by modifying the Mel frequency filter bank. When the warp is applied in the spectral domain, the problem of mismatching frequency ranges occurs. [Wegmann et al (1996)] used a piecewise linear spectral mapping to avoid this problem. They estimated the slope of the transformation function based on a maximum likelihood criterion. [Eide and Gish (1996)] proposed a compromise of different vowel models, namely the uniform tube model and the Helmholtz resonator. They warped the frequency axis f of a speaker according to

Full Text