Abstract

Current voice activity detection methods generally utilise only acoustic information. Therefore they are susceptible to false classification because of the presence of other acoustic sources such as another speaker or non-stationary noise. To address this issue, the authors propose a new method of voice activity detection using solely visual information in the form of a speaker's mouth region. Such video information is not affected by the acoustic environment. Simulations show that a high percentage correct silence detection (CSD) can be obtained with a low percentage false silence detection (FSD). Comparisons with two other visual voice activity detectors show the proposed method to be consistently more accurate, and on average yields a 4% improvement in CSD. The usefulness of the method is confirmed by applying it to a previously published audio–visual convolutive blind source separation algorithm, to increase the intelligibility of a speaker.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.