Abstract

AbstractVoice activity detection (VAD) is a vital module in various applications like speech recognition, speech enhancement, and dominant speaker identification. The performance of voice activity detectors using audio cues considerably declines under low SNR conditions. One method to improve the performance of VAD is to use the video signal, which is independent to acoustic background. At the present time, video calls have become a more popular way of communication, and recent products like laptops and smartphones, have cameras and microphones inbuilt. The accessibility of a video signal, along with the audio signal, can be used for voice activity detection particularly in a noisy environment, as the video signal is independent to background acoustic noise. This paper aims to develop a binary mask for voice activity detection to separate the target speech from the noisy speech mixture using the visual cues. The visual cues will be extracted by the mouth detection using Viola–Jones algorithm and the lip movement tracking using Kanade–Lucas–Tomasi (KLT) algorithm. Finally, the extracted mask using visual cues is compared with the mask obtained using audio cues under low SNR conditions. The performance is evaluated for the proposed system using PESQ. The experimental result shows that the proposed system performs well under low SNR conditions and improves the average PESQ score of 0.57 as compared to the other existing systems which use only auditory cues for voice activity detection.KeywordsVoice activity detectorVisual cuesBinary maskVisual-VAD (V-VAD)Audio-VAD (A-VAD)

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call