Abstract

This paper concerns the problem of automatic speech recognition in noise-intense and adverse environments. The main goal of the proposed work is the definition, implementation, and evaluation of a novel noise robust speech signal parameterization algorithm. The proposed procedure is based on time-frequency speech signal representation using wavelet packet decomposition. A new modified soft thresholding algorithm based on time-frequency adaptive threshold determination was developed to efficiently reduce the level of additive noise in the input noisy speech signal. A two-stage Gaussian mixture model (GMM)-based classifier was developed to perform speech/nonspeech as well as voiced/unvoiced classification. The adaptive topology of the wavelet packet decomposition tree based on voiced/unvoiced detection was introduced to separately analyze voiced and unvoiced segments of the speech signal. The main feature vector consists of a combination of log-root compressed wavelet packet parameters, and autoregressive parameters. The final output feature vector is produced using a two-staged feature vector postprocessing procedure. In the experimental framework, the noisy speech databases Aurora 2 and Aurora 3 were applied together with corresponding standardized acoustical model training/testing procedures. The automatic speech recognition performance achieved using the proposed noise robust speech parameterization procedure was compared to the standardized mel-frequency cepstral coefficient (MFCC) feature extraction procedures ETSI ES 201 108 and ETSI ES 202 050.

Highlights

  • Automatic speech recognition (ASR) systems have become indispensable integral parts of modern multimodal manmachine communication dialog applications such as voicedriven service portals, speech interfaces in automotive navigational and guidance systems, or speech-driven applications in modern offices [1]

  • The automatic speech recognition performance of the WPDAM was evaluated by a comparison with the standard baseline mel-frequency cepstral coefficient (MFCC) front ends, which were determined by the Aurora distributed speech recognition (DSR) group [29, 30, 35]

  • This article presents a novel noise robust speech parameterization procedure WPDAM based on wavelet packet decomposition

Read more

Summary

INTRODUCTION

Automatic speech recognition (ASR) systems have become indispensable integral parts of modern multimodal manmachine communication dialog applications such as voicedriven service portals, speech interfaces in automotive navigational and guidance systems, or speech-driven applications in modern offices [1]. The first one comprises noise robust speech parameterization techniques and the second group consists of acoustical model compensation approaches In both cases, the methods for robust speech recognition are focused on minimization of the acoustical mismatch between training and testing (recognition) environments. The proposed noise robust front-end procedure produces solutions for all the four noise robust speech parameterization issues mentioned above and should, achieve better automatic speech recognition performance in comparison with the standardized mel-frequency cepstral coefficient (MFCC) feature extraction procedure [5, 6]. One of the objectives of the proposed noise robust speech parameterization procedure is the development of a computationally efficient improved alternative— a denoising algorithm based on modified soft thresholding strategy with the application of time-frequency adaptive threshold and adaptive thresholding strength.

DEFINITION OF PROPOSED ALGORITHM WPDAM
INPUT SPEECH SIGNAL PREPROCESSING PROCEDURE
WPD-BASED SPEECH SIGNAL DENOISING PROCEDURE
Definition of the WPD applied in the proposed denoising procedure
The definition of proposed time-frequency adaptive threshold
Modified soft thresholding algorithm
Feature vector definitions for speech activity and voicing detection
Statistical classifier for speech activity and voicing detection
THE ADAPTIVE TOPOLOGY OF THE WAVELET PACKET DECOMPOSITION TREE
WPD-BASED SPEECH PARAMETERS
The combined root-log compression characteristics
PRIMARY FEATURE VECTOR BASED ON JOINT WPD AND AUTOREGRESSIVE MODELING
FEATURE VECTOR POSTPROCESSING PROCEDURE
10. EXPERIMENTAL FRAMEWORK AND RESULTS
10.1. Separate evaluation of particular WPDAM processing steps
10.2. WPDAM Aurora 3 performance evaluation
10.3. WPDAM Aurora 2 performance evaluation
10.5. WPDAM computational complexity and real-time deployment feasibility
11. CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.