Abstract
In this article, a novel voice activity detection (VAD) approach based on phoneme recognition using Gaussian Mixture Model based Hidden Markov Model (HMM/GMM) is proposed. Some sophisticated speech features such as high order statistics (HOS), harmonic structure information and Mel-frequency cepstral coefficients (MFCCs) are employed to represent each speech/non-speech segment. The main idea of this new method is regarding the non-speech as a new phoneme corresponding to the conventional phonemes in mandarin, and all of them are then trained under maximum likelihood principle with Baum-Welch algorithm using GMM/HMM model. The Viterbi decoding algorithm is finally used for searching the maximum likelihood of the observed signals. The proposed method shows a higher speech/non-speech detection accuracy over a wide range of SNR regimes compared with some existing VAD methods. We also propose a different method to demonstrate that the conventional speech enhancement method only with accurate VAD is not effective enough for automatic speech recognition (ASR) at low SNR regimes.
Highlights
Voice activity detection (VAD), which is a scheme to detect the presence of speech in the observed signals automatically, plays an important role in speech signal processing [1,2,3,4]
Our experiments show a higher detection accuracy compared with the existing VAD methods on the same Microsoft Research Asia (MSRA) mandarin speech corpus
The feature parameters for the Hidden Markov Model (HMM)/ Gaussian Mixture Model (GMM) hybrid model based VAD are extracted at intervals of 20 ms frame length and 10 ms frame shift length, composed of 13th order harmonic structure information features, 1st order skewness, 1st order kurtosis, 12th order log-Mel spectra with energy and its Δ, leading to an HMM set with 5 states
Summary
Voice activity detection (VAD), which is a scheme to detect the presence of speech in the observed signals automatically, plays an important role in speech signal processing [1,2,3,4]. Gorriz et al [15] incorporated contextual information in a multiple observation LRT to overcome the non-stationary noise In these studies, the estimation error of signal-to-noise ratio (SNR) seriously affects the accuracy of VAD. Fukuda et al [11] used a large vocabulary with high order GMMs for discriminating the non-speech from speech that made a significant improvement of recognition rate in ASR system. They are not suitable for some cases To handle these problems, using the GMM based HMM recognizer for discriminating the non-speech from the speech can reduce the number of mixtures and can improve the accuracy of VAD without the experimental threshold.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have