Abstract

This paper presents a novel computationally efficient voice activity detection (VAD) algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR) systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB) outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end) Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs) ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by 14.19% relative (G.723.1 VAD), by 12.84% relative (G.729 VAD), and by 4.17% relative (DSR VAD) in all SNRs.

Highlights

  • Voice activity detection (VAD) is an algorithm that is able to distinguish speech from noise and is an integral part of a variety of speech communication systems, such as speech recognition, speech coding, hands-free telephony, and audio conferencing

  • Comparative tests are presented between the presented mel-filter bank (MFB) voice activity detection (VAD) algorithm and three VAD algorithms used in the G.729, G.723.1, and distributed speech recognition (DSR) Standards

  • We conducted comparative tests between the proposed VAD (MFB VAD) algorithm and the three VAD algorithms used in the G.729, G.723.1, and DSR Standards

Read more

Summary

Introduction

Voice activity detection (VAD) is an algorithm that is able to distinguish speech from noise and is an integral part of a variety of speech communication systems, such as speech recognition, speech coding, hands-free telephony, and audio conferencing. An input signal often contains many nonspeech parts, which can reduce the speech recognition performance of ASR systems. This is especially true when the ASR system operates under adverse conditions. The majority of applicable recognizers have to work with a much smaller SNR (typically between 25 dB and 15 dB, and as low as 5 dB). Under such conditions, it becomes very difficult to detect weak fricatives, weak nasals, and low-amplitude voiced

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call