A novel voice activity detection algorithm using modified global thresholding

R Johny Elton,P Vasuki,J Mohanalin

doi:10.1007/s10772-020-09777-w

Abstract

Voice activity detection is currently a challenging task that is applicable in real time applications such as speech coding and recognition. It is due to the low signal-to-noise ratio that affected the structural properties. Voice activity detection helps in detecting the speech region that is present in various nonstationary noises. The literature associated with Voice activity detection suggests that numerous works use unbalanced classification approach with higher and poor, speech and non-speech detection rates, respectively. This leads to the condition that majority of the noisy segments are categorized as speech. Hence, to overcome this issue, we propose a novel modified global thresholding scheme that has a fuzzy entropy tool. Our proposal can effectively identify both regions by locating the transition from non-speech to speech areas and vice versa. This will improve the detection rates as misclassification error of noisy segments as speech segments are minimized. The performance of the proposed algorithm is tested on various additive non-stationary noises at different SNR levels. In most of the existing research, it is often assumed that the noise is stationary for a particular instant in order to estimate the noise information. But in real scenario this is impossible. Our significant contribution is in developing an algorithm that handles the signals which possess nonstationary noises and various complex events which can be a mixture of different noises. As the characteristics of speech vary over time (nonstationary), when additively mixed with nonstationary noises becomes more challenging especially at low SNR levels (− 5 dB, − 10 dB). Therefore, the problem becomes more complicated like that in the real-time scenario. Our proposed method produces 91.98% and 87.38% of speech and non-speech detection rates in low SNR levels, respectively. It also obtains an accuracy of 93.39% for speech babble noises against the state-of-art algorithms which varied between 50 and 80% only. Similarly, NDS rates of the proposed algorithm is very minimal, i.e. less than 10% compared to the bench mark algorithms which had at least 50% or more of the noise detected as speech segments. The significance of our invention is in precisely locating where a speech begins and ends in a given noisy speech. We believe that we have produced a path breaking approach that can be helpful in real time applications in speech processing.

Full Text