Abstract

The gap between human and machine performance on speech recognition tasks is still very large. Recognition of words in telephone conversations is slightly better than 50%, based on results reported on the Switchboard corpus by leading researchers using state of the art HMM systems. We know from our own experience that human perception typically delivers much more accurate word recognition over the telephone. Why is there such a large gap between machine and human performance, and what can be done to dose this gap? One way to address this question is to study the sources of linguistic information in the speech signal that are known to be important for word recognition, and measure how well machine systems utilize this information relative to humans. We measured word recognition performance of listeners presented with words from the Switchboard corpus. Stimuli consisted of actual utterances excised from the Switchboard corpus, high quality recordings of utterances that occurred in Switchboard conversations, and recordings of word sequences with zero, medium and high bigram probabilities based on a language model computed from transcriptions of the Switchboard corpus. The results show that human listeners are very good at recognizing words in the absence of word sequence constraints, and that statistical language models fail to capture much of the high level linguistic information needed to recognize words in fluent speech. The results are discussed in terms of their implications to current approaches to acoustic and language modeling in computer speech recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call