On local time–frequency features of speech and their employment in speaker verification

Robert M Nickel,William J Williams

doi:10.1016/s0016-0032(00)00032-6

Abstract

Commonly used robust speaker verification systems 1 1 We are only concerned about text-dependent verification with cooperative speakers in a low-noise environment. are based on time-varying autoregressive spectral estimation (AR) combined with hidden Markov modeling (HMM) or dynamic time warping (DTW). An exhaustive optimization of these methods in the past has culminated in quite reliable verification schemes. It seems unlikely, though, that further significant improvements are readily obtained along the same path. While short-time AR-modeling focuses on the time-varying spectral envelope of an utterance, we are introducing a new method that focuses on high-resolution estimates of the time-varying spectral structure of individual pitch periods. The new method employs reduced interference time–frequency distributions (RIDs) in combination with a scale and translation invariant pattern recognition technique (STIR). The new method by itself does not deliver better results than commonly used techniques; however, it is shown that an acceptance/rejection decision derived from both AR-DTW and RID–STIR features greatly improves the performance of the verification system.

Full Text