In May 1988, a speaker‐independent recognition algorithm was described at the 115th Meeting of the Acoustical Society of America [Scott et al., J. Acoust. Soc. Am. Suppl. 1 83, S55 (1988)]. The algorithm yielded 95.4% recognition accuracy on the 20‐word TI database obtained from the National Bureau of Standards. The system has since been adapted for use over the telephone. In so doing, a new database was developed consisting of 16 words (zero‐nine, oh, yes, no, cancel, terminate, and help) as spoken by 11 males and 11 females from various locations across the country. Although small (1001 utterances), the database represented a significant challenge as compared to the one obtained from the NBS. There were fewer training passes per word, more speakers, and there was considerably more noise in the database. Initial tests on this database yielded accuracies of approximately 60%. Four major enhancements to the algorithm improved the accuracy on this database to 95.3% Two of the enhancements compensate for time alignment problems inherent in both linear and nonlinear time normalization routines through the use of prosody and redundancy. The other two enhancements include the extraction of additional features and better use of variances. Results indicate that acceptable speaker‐independent recognition can be obtained with minimal training and processor requirements, given effective normalization routines during the front‐end signal processing.
Read full abstract