A time/speaker normalization technique for word verification

D L Heisey,K P Li,C K Kau

doi:10.1121/1.2016876

Abstract

The present approach for solving the word verification problem involves matching an input template with a known word reference template, The multidimensional template consists of formant values over the duration of the utterance. Due to speaking rate and idiosyncratic variations of speakers, both temporal and spectral normalizations are required. The time normalization technique employs a piecewise linear time warping function to map the unnormalized utterance into the normalized reference space. Pivot points used to align the input data template with the reference template are located in the utterance using maximum a posteriori (MAP) estimation. Statistics required to define the multidimensional density function for the estimator are obtained by training on known pivot point locations. Frequency normalization is then performed where the input template is linearly scaled (with a gain and bias) to minimize the weighted mean‐square error between the scaled input template and the reference template. This error in addition to the corresponding gain and bias terms are used in a metric for word verification.

Full Text