Abstract

We present our work on visual pruning in an audio-visual (AV) speech recognition scenario. Visual speech information has been successfully used in circumstances where audio-only recognition suffers (e.g. noisy environments). Tracking and extraction of region-of-interest (ROI) (e.g., speaker's mouth region) from video is an essential component of such systems. It is important for the visual front-end to handle tracking errors that result in noisy visual data and hamper performance. We present our robust visual front-end, investigate methods to prune visual noise and its effect on the performance of the AV speech recognition systems. Specifically, we estimate the goodness of ROI using Gaussian mixture models and our experiments indicate that significant performance gains are achieved with good quality visual data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.