Abstract
This paper presents a novel computationally efficient voice activity detection (VAD) algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR) systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB) outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end) Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs) ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by 14.19% relative (G.723.1 VAD), by 12.84% relative (G.729 VAD), and by 4.17% relative (DSR VAD) in all SNRs.
Highlights
Voice activity detection (VAD) is an algorithm that is able to distinguish speech from noise and is an integral part of a variety of speech communication systems, such as speech recognition, speech coding, hands-free telephony, and audio conferencing
Comparative tests are presented between the presented mel-filter bank (MFB) voice activity detection (VAD) algorithm and three VAD algorithms used in the G.729, G.723.1, and distributed speech recognition (DSR) Standards
We conducted comparative tests between the proposed VAD (MFB VAD) algorithm and the three VAD algorithms used in the G.729, G.723.1, and DSR Standards
Summary
Voice activity detection (VAD) is an algorithm that is able to distinguish speech from noise and is an integral part of a variety of speech communication systems, such as speech recognition, speech coding, hands-free telephony, and audio conferencing. An input signal often contains many nonspeech parts, which can reduce the speech recognition performance of ASR systems. This is especially true when the ASR system operates under adverse conditions. The majority of applicable recognizers have to work with a much smaller SNR (typically between 25 dB and 15 dB, and as low as 5 dB). Under such conditions, it becomes very difficult to detect weak fricatives, weak nasals, and low-amplitude voiced
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have