Abstract

This paper proposes a novel and robust voice activity detection (VAD) algorithm utilizing long-term spectral flatness measure (LSFM) which is capable of working at 10 dB and lower signal-to-noise ratios(SNRs). This new LSFM-based VAD improves speech detection robustness in various noisy environments by employing a low-variance spectrum estimate and an adaptive threshold. The discriminative power of the new LSFM feature is shown by conducting an analysis of the speech/non-speech LSFM distributions. The proposed algorithm was evaluated under 12 types of noises (11 from NOISEX-92 and speech-shaped noise) and five types of SNR in core TIMIT test corpus. Comparisons with three modern standardized algorithms (ETSI adaptive multi-rate (AMR) options AMR1 and AMR2 and ITU-T G.729) demonstrate that our proposed LSFM-based VAD scheme achieved the best average accuracy rate. A long-term signal variability (LTSV)-based VAD scheme is also compared with our proposed method. The results show that our proposed algorithm outperforms the LTSV-based VAD scheme for most of the noises considered including difficult noises like machine gun noise and speech babble noise.

Highlights

  • Voice activity detection (VAD) is a method to discriminate speech segments from input noisy speech

  • The main contribution of this article was the introduction of an efficient long-term spectral flatness measure-based VAD algorithm

  • The motivation of exploring flatness measure along time frames using a long window was clarified by the long-term spectral flatness measure (LSFM) feature distributions as a function of the long-term window length R

Read more

Summary

Introduction

Voice activity detection (VAD) is a method to discriminate speech segments from input noisy speech. The performance degrades when faced with low signalto-noise ratio (SNR) or non-stationary background noise To solve this problem, robust acoustic features such as spectrum [8], autocorrelation [9], power in the bandlimited region [10], and higher-order statistics [11] have. In contrast with the use of frame level features, Ramirez et al [12] proposed the use of a long-term spectral divergence feature to discriminate speech from noise. It requires average noise spectrum magnitude information which is not accurately available in practice. The discriminative power of the proposed LSFM feature will be verified by researching the distribution of LSFM measure for speech and non-speech in terms of their misclassification rate for various noises.

Normalized Count
Non speech
Computation of the LSFM feature
VAD outputs
Machine gun Speech babble Speech shaped
VAD Decision
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.