Abstract
Robustness against background noise is a major research area for speech-related applications such as speech recognition and speaker recognition. One of the many solutions for this problem is to detect speech-dominant regions by using a voice activity detector (VAD). In this paper, a second-order polynomial regression-based algorithm is proposed with a similar function as a VAD for text-independent speaker verification systems. The proposed method aims to separate steady noise/silence regions, steady speech regions, and speech onset/offset regions. The regression is applied independently to each filter band of a mel spectrum, which makes the algorithm fit seamlessly to the conventional extraction process of the mel-frequency cepstral coefficients (MFCCs). The k-means algorithm is also applied to estimate average noise energy in each band for spectral subtraction. A pseudo SNR-dependent linear thresholding for the final VAD output decision is introduced based on the k-means energy centers. This thresholding considers the speech presence in each band. Conventional VADs usually neglect the deteriorative effects of the additive noise in the speech regions. Contrary to this, the proposed method decides not only for the speech presence, but also if the frame is dominated by the speech, or the noise. Performance of the proposed algorithm is compared with a continuous noise tracking method, and another VAD method in speaker verification experiments, where five different noise types at five different SNR levels were considered. The proposed algorithm showed superior verification performance both with the conventional GMM-UBM method, and the state-of-the-art i-vector method.
Highlights
Automatic speaker recognition systems’ performances are greatly improved in the last two decades, especially with the introduction of the modeling methods such as universal background model (UBM) [1] and i-vectors [2]
Instead of using the mel-frequency cepstral coefficients (MFCCs), other types of features are proposed by the researchers to increase the robustness of the recognizers [4,5,6,7,8]
Since the MFCCs are widely adopted, many researchers have made effort to improve its robustness under noise by modifying, or changing, some processes in the conventional scheme [9,10,11,12,13]
Summary
Automatic speaker recognition systems’ performances are greatly improved in the last two decades, especially with the introduction of the modeling methods such as universal background model (UBM) [1] and i-vectors [2]. Mel-frequency cepstral coefficients (MFCCs) [3] are extensively preferred by the researchers in speaker or speech recognition systems. Many different techniques have been developed to overcome this issue such as using a voice activity detector (VAD), extracting robust features, and speech. Instead of using the MFCCs, other types of features are proposed by the researchers to increase the robustness of the recognizers [4,5,6,7,8]. Since the MFCCs are widely adopted, many researchers have made effort to improve its robustness under noise by modifying, or changing, some processes in the conventional scheme [9,10,11,12,13]. Interested readers may refer to [14] for the recent progress in the feature extraction techniques for robust speaker recognition
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have