Experiments comparing isolated word recognition by human listners with automatic speech recognition systems are valuable because error analyses may lead to improvements in speech recognition technology. Isolated word recognition in adult human listeners has been compared with recognition performance by two commercially available speech-recognition systems. The test stimuli were drawn from the Lincoln Laboratory Stressed-Speech database. The database consists of 6930 stimuli (two iterations of each of 35 words spoken by nine different people in 11 different speaking styles). The vocabulary contains confusable words (i.e., go, hello, oh, no, and zero); the speaking styles include a wide range of naturally occurring variations (i.e., normal, slow, fast, soft, loud, angry). Analyses show that the acoustic characteristics of individual words vary considerably across talkers, and across styles within talkers. Performance of human listeners and the two machine-based recognition systems was tested in a single-talker, multistyle condition, and in a multitalker, multistyle condition. All tests were conducted under two listening conditions: normal, and in the presence of masking noise. The data to be presented are the error patterns of human listeners, versus the machine-recognition systems, exhibited across talkers, across speaking styles, and across training conditions (multitalker, multistyle training versus single talker, single style training). [Work supported by Boeing Aerospace and Electronics.]
Read full abstract