Abstract

Robustness of speech recognition can be significantly improved by multi-stream and especially by audiovisual speech recognition. This is of interest for example for human-machine interaction in noisy reverberant environments, and for transcription of or search in multimedia data. The most robust implementations of audiovisual speech recognition often utilize Coupled Hidden Markov Models (CHMMs), which allow for both modalities to be asynchronous to a certain degree. In contrast to conventional speech recognition, this increases the search space significantly, so current implementations of CHMM systems are often not real-time capable. Thus, for real-time constrained applications such as online transcription of VoIP communication or responsive multi-modal human-machine interaction, using current multiprocessor computing capability is vital. This paper describes how general purpose graphics processors can be used to obtain a real-time implementation of audiovisual and multi-stream speech recognition. The design has been integrated both with a WFST-decoder and a token passing system, with parallelization leading to a maximum speedup factor of 32 and 25, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call