Abstract

In recent years a number of techniques have been proposed to improve the accuracy and the robustness of automatic speech recognition in noisy environments. Among these, suplementing the acoustic information with visual data, mostly extracted from speaker's lip shapes, has been proved to be successful. We have already demonstrated the effectiveness of integrating visual data at two different levels during speech decoding according to both direct and separate identification strategies (DI+SI). This paper outlines methods for reinforcing the visible speech recognition in the framework of separate identification. First, we define visual-specific units using a self-organizing mapping technique. Second, we complete a stochastic learning of these units with a discriminative neural-network-based technique for speech recognition purposes. Finally, we show on a connected-letter speech recognition task that using these methods improves performances of the DI+SI based system under varying noise-level conditions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call