Improving generative adversarial networks for speech enhancement through regularization of latent representations

Fan Yang,Ziteng Wang,Junfeng Li,Risheng Xia,Yonghong Yan

doi:10.1016/j.specom.2020.02.001

Abstract

Speech enhancement aims to improve the quality and intelligibility of speech signals, which is a challenging task in adverse environments. Speech enhancement generative adversarial network (SEGAN) that adopted a generative adversarial network (GAN) for speech enhancement achieved promising results. In this paper, a new network architecture and loss function based on SEGAN are proposed for speech enhancement. Different from most network structures applied in this field, the new network, called high-level GAN (HLGAN), uses parallel noisy and clean speech signals as input in the training phase instead of only noisy speech signals, which enables us to make full use of the information carried by the clean speech signals. Additionally, we introduce a new supervised speech representation loss, also known as high-level loss, in the middle hidden layer of the generative network. The high-level loss function is advantageous to HLGAN in speech enhancement under low signal-to-noise (SNR) environments and low-resource environments. We evaluate the performance of HLGAN over a wide range of experiments, in which our model produces significant improvements. Extensive experiments further demonstrate the generality of our model in a variety of speech enhancement cases. The issue of SEGAN losing speech components while removing noise in low SNR environments is improved. In addition, HLGAN can effectively enhance the speech signals of two low-resource languages simultaneously. The reasons for the superior performance of HLGAN are discussed.

Full Text