The performance of automatic speech recognition systems is commonly measured by using large quantities of natural speech as a benchmark. For recognizers which accept large vocabularies, or when many alternative vocabularies are of interest, this method requires an unreasonably large natural speech corpus. Synthetic speech which incorporates variations in pronunciation is an alternative in these cases. This approach was used to evaluate the performance of a speaker trained, isolated word speech recognition system [H. Murveit, M. Lowy, and R. W. Brodersen, J. Acoust. Soc. Am. Suppl. 1 69, S8 (1981)]. Each word to be recognized was modeled as a sequence of segments with each segment being variable length and having a uniform spectrum with added noise. A few tokens of natural speech were used to evaluate the parameters of the model. The model was then used to produce synthetic tokens for testing the recognizer. Good correlation was obtained between the confusion matrices for synthetic and natural speech when recognized. [Work supported in part by DARPA.]
Read full abstract