On the application of embedded digit training to speaker independent connected digit recognition

L Rabiner,A Quinn,J Wilpon,S Terrace

doi:10.1109/tassp.1984.1164298

Abstract

In recent years, several algorithms have been proposed for recognizing a string of connected words (typically digits) by optimally piecing together reference patterns corresponding to the words in the string. Although the algorithms differ greatly in details of implementation, storage requirements, etc., they all have essentially the same performance in that their ability to match the unknown string is related to how well words spoken in isolation can match their counterparts in connected speech. For low rates of articulation (i.e., about 100-130 words per minute) the performance of such connected word recognition systems is excellent. However, as the articulation rate approaches that of continuous discourse (180-300 words per minute) the performance of such connected word recognizers falls dramatically. To partially alleviate these problems a modified training procedure was devised in which multiple versions of each reference word were used. The multiple versions included an isolated form for each word, and 2 versions of the word extracted from the middle of 3 word sequences. One of these embedded reference patterns represented a noncontextual token of the word (i.e., spoken in a format where the words on either side had minimal effect on the acoustic properties at the boundaries), and the second represented a highly contextual token of the word. It was shown that a training algorithm could be devised to obtain these embedded reference tokens, and that when using the multiple reference patterns, the performance in a speaker trained system was greatly improved at faster talking rates. In this paper we show how the embedded training technique can be extended to the case of speaker, independent connected word recognizers. In particular, we show that improved recognition performance on connected digit strings is obtained by using standard clustering procedures on the embedded tokens to give a speaker-independent embedded reference set. We also show that the use of the K-nearest neighbor (KNN) rule leads to additional real improvements in performance for recognizing strings of connected digits. A discussion of the types of problems that remain is given.

Full Text