Abstract

This paper examines the behavior of two different energy-based voice activity detector (VAD) algorithms for noisy input signals. The examined detectors use time-domain methods to find speech boundaries. Time-domain short time energy features and/or zero-crossing rate of speech signals are used to evaluate the performance of the methods. In the first stage of both algorithms, time-domain short-time energy (STE) features are calculated for each speech segment. Then energy ratios and threshold values are used to detect any voicing activity of speech signals. The decision threshold value is calculated by evaluating the average STE of an initial silence period. The effectiveness of the selected methods is tested for clean and noisy speech samples. The methods are tested using the noisy speech signals under different SNR levels. The results indicated that both methods achieve a reasonable accuracy as low as an SNR value nearly 0dB with a slowly decreasing performance. But, under 0dB SNR, both methods lose their effectiveness against noisy conditions.

Highlights

  • Digital speech processing applications try to separate voice-active speech periods from inactive ones to minimize the process time

  • This paper examines the behavior of two different energy-based voice activity detector (VAD) algorithms for noisy input signals

  • The methods have been tested on different SNR levels of background noises and for many input test data

Read more

Summary

Introduction

Digital speech processing applications try to separate voice-active speech periods from inactive (silence) ones to minimize the process time. For time domain VAD algorithms, the amplitude of a speech in a frame is an important parameter to classify the frames as voice-active or inactive. Unvoiced and plosive sounds have lower amplitude than voiced sounds, they contain important information for speech, especially when detecting the beginning and endpoints of a speech. For an unvoiced speech at the beginning or endpoints of an utterance while energy is close to silence energy, there is a sharp increase in the ZCR. For the utterances beginning or ending with weak fricatives such as (/f, th, h/), voiced fricatives becoming devoiced, weak plosive bursts such as (/p, t, k/), ending with nasal sounds such as (/n, m/), or some voiced sounds the final /i/ becoming unvoiced in the word such as "three" (/th-r-i/) or "binary" (/b-al-n-e-r-i/), it is difficult problem to locate to the accurate points for a VAD algorithm. For the utterances beginning or ending with weak fricatives such as (/f, th, h/), voiced fricatives becoming devoiced, weak plosive bursts such as (/p, t, k/), ending with nasal sounds such as (/n, m/), or some voiced sounds the final /i/ becoming unvoiced in the word such as "three" (/th-r-i/) or "binary" (/b-al-n-e-r-i/), it is difficult problem to locate to the accurate points for a VAD algorithm. [1,2,3]

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.