Abstract
Generative adversarial networks (GANs) have shown their superiority for speech enhancement. Nevertheless, most previous attempts had convolutional layers as the backbone, which may obscure long-range dependencies across an input sequence due to the convolution operator’s local receptive field. One popular solution is substituting recurrent neural networks (RNNs) for convolutional neural networks, but RNNs are computationally inefficient, caused by the unparallelization of their temporal iterations. To circumvent this limitation, we propose an end-to-end system for speech enhancement by applying the self-attention mechanism to GANs. We aim to achieve a system that is flexible in modeling both long-range and local interactions and can be computationally efficient at the same time. Our work is implemented in three phases: firstly, we apply the stand-alone self-attention layer in speech enhancement GANs. Secondly, we employ locality modeling on the stand-alone self-attention layer. Lastly, we investigate the functionality of the self-attention augmented convolutional speech enhancement GANs. Systematic experiment results indicate that equipped with the stand-alone self-attention layer, the system outperforms baseline systems across classic evaluation criteria with up to 95% fewer parameters. Moreover, locality modeling can be a parameter-free approach for further performance improvement, and self-attention augmentation also overtakes all baseline systems with acceptably increased parameters.
Highlights
Speech enhancement aims to improve speech intelligibility and quality in adverse environments by transforming the interfered speech to its original clean version [1]
This paper presents a series of speech enhancement generative neural networks (GANs) (SEGANs) equipped with a self-attention mechanism in three ways: first, we deploy the stand-alone self-attention layer in a SEGAN
A maximum of three layers of SEGAN are equipped with the self-attention mechanism each time: one convolutional layer of the encoder, one deconvolutional layer of the decoder, and one convolutional layer of the discriminator. They experimented with the performance of SASEGAN-all, i.e., coupling self-attention layers to allconvolutional layers, we query whether there are more optimized coupling combinations. For example, can coupling the self-attention mechanism to the 10th and 11thconvolutional layers outperform SASEGAN-all with even smaller parameters? In addition, inspired by [32,33], we explore the feasibility of substituting selfattention layers withconvolutional layers completely, namely SEGAN with stand-alone self-attention layers
Summary
Speech enhancement aims to improve speech intelligibility and quality in adverse environments by transforming the interfered speech to its original clean version [1]. Speech enhancement can serve as a front end for downstream speech-related tasks, e.g., speech recognition [2], speaker identification [3], speech emotion recognition [4], etc. It is applied successfully in communication systems, e.g., hearing aids [5] and cochlear implants [6]. Most previous attempts had convolutional layers as the backbone, limiting the network’s ability in capturing longrange dependencies due to the convolution operator’s local receptive field To remedy this issue, one popular solution is substituting RNNs for CNNs, but RNNs are computationally inefficient, caused by the unparallelization of their temporal iterations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.