Abstract

This paper is concerned with small vocabulary speech recognition from conversational utterances over the telephone network. Modeling techniques are investigated for dealing with large numbers of nonvocabulary words and artifacts that arise in these utterances. A hidden Markov model (HMM)-based continuous speech recognition system using a frame synchronous Viterbi beam search decoder is used for recognition. Keyword models compete in the finite state network with ‘‘filler’’ models of nonkeyword speech. Several issues were investigated relating to the quality of acoustic representations and language representations for this task. The first issue that was investigated was the definition of acoustic subword units using allophone clustering procedures. The second issue was the size of the vocabulary used for modeling non-keyword utterances. Finally, the last issue was the use of language models in unconstrained speech tasks. Experimental results will be presented for a 20-keyword recognition task where performance was evaluated on continuous utterances from 22 speakers. The results showed that all of the procedures including decision tree based allophone clustering, better out-of-vocabulary speech representations, and language models contributed to overall recognition performance. The best performing system provided 76% average probability of keyword detection at 5.8 false alarms per keyword per hour.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.