Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques

D Kolossa,A Klimas,R Orglmeister

doi:10.1109/aspaa.2005.1540174

Abstract

Time-frequency masking has emerged as a powerful technique for source separation of noisy and convolved speech mixtures. It has also been applied successfully for noisy speech recognition. But while significant SNR gains are possible by adequate masking functions, speech recognition performance suffers from the involved nonlinear operations so that the greatly improved SNR often contrasts with only slight improvements in the recognition rate. To address this problem, marginalization techniques have been used for speech recognition, but they rely on speech recognition and source separation to be carried out in the same domain. However, source separation and denoising are often carried out in the short-time-Fourier-transform (STFT) domain, whereas the most useful speech recognition features are e.g. mel-frequency cepstral coefficients (MFCCs), LPC-cepstral coefficients and VQ-features. In these cases, marginalization techniques are not directly applicable. Here, another approach is suggested, which estimates sufficient statistics for speech features in the preprocessing (e.g. STFT-) domain, propagates these statistics through the transforms from the spectrum to e.g. the MFCC's of a speech recognition system and uses the estimated statistics for missing data speech recognition. With this approach, significant gains can be achieved in speech recognition rates, and in this context, time-frequency masking yields recognition rate improvements of more than 35% when compared to TF-masking based source separation

Full Text