Abstract

Most of voice activity detection (VAD) schemes are operated in the discrete Fourier transform (DFT) domain by classifying each sound frame into speech or noise based on the DFT coefficients. These coefficients are used as features in VAD, and thus the robustness of these features has an important effect on the performance of VAD scheme. However, some shortcomings of modeling a signal in the DFT domain can easily degrade the performance of a VAD in a noise environment. Instead of using the DFT coefficients in VAD, this article presents a novel approach by using the complex coefficients derived from complex exponential atomic decomposition of a signal. With the goodness-of-fit test, we show that those coefficients are suitable to be modeled by a Gaussian probability distribution. A statistical model is employed to derive the decision rule from the likelihood ratio test. According to the experimental results, the proposed VAD method shows better performance than the VAD based on the DFT coefficients in various noise environments.

Highlights

  • Voice activity detection (VAD) refers to the problem of distinguishing active speech from non-speech regions in an given audio stream, and it has become an indispensable component for many applications of speech processing and modern speech communication systems [1,2,3] such as robust speech recognition, speech enhancement, and coding systems

  • We present an approach for VAD based on the conjugate subspace matching pursuit (MP) and the statistical model

  • Based on the receiver operating characteristic (ROC) curves, we evaluated the performances of the proposed likelihood ratio test (LRT) VAD based on the MP coefficients (LRT-MP) by comparing with the popular LRT VADs based on discrete Fourier transform (DFT) coefficients, including Gaussian (LRT-Gaussian) [7], Laplacian (LRT-Laplacian) [8], and Gamma (LRT-Gamma) [10]

Read more

Summary

Introduction

Voice activity detection (VAD) refers to the problem of distinguishing active speech from non-speech regions in an given audio stream, and it has become an indispensable component for many applications of speech processing and modern speech communication systems [1,2,3] such as robust speech recognition, speech enhancement, and coding systems. Various traditional VAD algorithms have been proposed based on the energy, zero-crossing rate, and spectral difference in earlier literature [1,4,5]. Much study for improving the performance of the VADs in various high noise environments has been carried out by incorporating a statistical model and a likelihood ratio test (LRT) [6]. Those algorithms assume that the distributions of the noise and the noisy speech spectra are specified in terms of some certain parametric models such as complex Gaussian [7], complex Laplacian [8], generalized Gaussian [9], or generalized Gamma distribution [10].

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call