Abstract

How to construct models for speech/nonspeech discrimination is a crucial point for voice activity detectors (VADs). Semi-supervised learning is the most popular way for model construction in conventional VADs. In this correspondence, we propose an unsupervised learning framework to construct statistical models for VAD. This framework is realized by a sequential Gaussian mixture model. It comprises an initialization process and an updating process. At each subband, the GMM is firstly initialized using EM algorithm, and then sequentially updated frame by frame. From the GMM, a self-regulatory threshold for discrimination is derived at each subband. Some constraints are introduced to this GMM for the sake of reliability. For the reason of unsupervised learning, the proposed VAD does not rely on an assumption that the first several frames of an utterance are nonspeech, which is widely used in most VADs. Moreover, the speech presence probability in the time-frequency domain is a byproduct of this VAD. We tested it on speech from TIMIT database and noise from NOISEX-92 database. The evaluations effectively showed its promising performance in comparison with VADs such as ITU G.729B, GSM AMR, and a typical semi-supervised VAD.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.