Conventional monaural speech enhancement methods usually enhance the magnitude spectrum of noisy speech and leave the phase unchanged. Recent studies suggest that phase is also important for both speech intelligibility and perceptual quality. Although deep learning exhibits great potential on enhancing the magnitude and phase spectra in complex spectrogram domain and waveform domain, complex spectrogram and waveform are always more difficult to predict than the magnitude spectrum due to lack of clear structure in them. In this study, a Mel-domain denoising autoencoder and a deep generative vocoder are stacked to form a joint framework for monaural speech enhancement, in which the clean speech waveform is reconstructed without using the phase. Specifically, a convolutional recurrent network (CRN) is employed as the denoising autoencoder to enhance the Mel power spectrum of noisy speech. Then, the enhanced Mel power spectrum is fed to a deep generative vocoder to synthesize the speech waveform. Furthermore, the denoising autoencoder and generative vocoder are jointly fine-tuned. Experimental results show that the proposed method significantly improves speech intelligibility and perceptual quality. More importantly, our method achieves much better generalization ability for untrained noises than previous methods.
Read full abstract