Abstract
Voice activity detection (VAD) is an essential segmentation process in speaker recognition systems, seperating speech and non-speech segments of voice samples. In speaker recognition, references are modelled purely by concerning speech segments. Different VAD segmentations lead to variations in biometric models, and consequently in system performance. Thus, VAD decisions need to be robust among different conditions. In this paper, the decision robustness of different VAD algorithms is examined on mobile data by simulating different environmental noise conditions for which we propose a Hamming distance based analysis. By examining speech and speaker recognition based VADs, we further propose to extend a well- performing VAD algorithm, which is based on likelihood ratio comparison of speech to non-speech models, by including most dominant frequency component (MDFC) features for selection of model training segments. Thereby, more robust VAD decisions are conducted by 7%, while sustaining an average EER SNR-sensitivity of 0.76% per dB SNR.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.