Abstract

Robustness of speech recognition can be significantly improved by multi-stream and especially by audiovisual speech recognition. This is of interest for example for human-machine interaction in noisy reverberant environments, and for transcription of or search in multimedia data. The most robust implementations of audiovisual speech recognition often utilize Coupled Hidden Markov Models (CHMMs), which allow for both modalities to be asynchronous to a certain degree. In contrast to conventional speech recognition, this increases the search space significantly, so current implementations of CHMM systems are often not real-time capable. Thus, for real-time constrained applications such as online transcription of VoIP communication or responsive multi-modal human-machine interaction, using current multiprocessor computing capability is vital. This paper describes how general purpose graphics processors can be used to obtain a real-time implementation of audiovisual and multi-stream speech recognition. The design has been integrated both with a WFST-decoder and a token passing system, with parallelization leading to a maximum speedup factor of 32 and 25, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.