Abstract

Despite the recent progress on neural network architectures for speech separation, the balance between the model size, model complexity and model performance is still an important and challenging problem for the deployment of such models to low-resource platforms. In this paper, we propose two simple modules, group communication and context codec, that can be easily applied to a wide range of architectures to jointly decrease the model size and complexity without sacrificing the performance. A group communication module splits a high-dimensional feature into groups of low-dimensional features and captures the inter-group dependency. A separation module with a significantly smaller model size can then be shared by all the groups. A context codec module, containing a context encoder and a context decoder, is designed as a learnable downsampling and upsampling module to decrease the length of a sequential feature processed by the separation module. The combination of the group communication and the context codec modules is referred to as the GC3 design. Experimental results show that applying GC3 on multiple network architectures for speech separation can achieve on-par or better performance with as small as 2.5% model size and 17.6% model complexity, respectively.

Highlights

  • R ECENT developments in the neural network architectures have significantly advanced the state-of-the-art for source separation performance

  • We introduce a context codec module to help GroupComm maintain the performance while further decreasing the number of MAC operations, accelerating the training speed and alleviating the memory consumption in both training and inference time

  • A residual bidirectional longshort term memory (LSTM) (BLSTM) layer identical to the one used in the context codec was selected for the GroupComm module in [35], and we present the experimental results on the comparison between original dual-path RNN (DPRNN)-time-domain audio separation network (TasNet), the GroupComm-equipped DPRNN-TasNet, and the GC3-equipped DPRNN-TasNet

Read more

Summary

INTRODUCTION

R ECENT developments in the neural network architectures have significantly advanced the state-of-the-art for source separation performance. GroupComm splits a high-dimensional feature, such as a spectrum, into groups of low-dimensional features, such as subband spectra, and uses the same separation model across all the groups for weight sharing Another inter-group module is applied to capture the dependencies within the groups, so that the processing of each group always depends on the global information available. The low-dimensional features enable the use of a smaller module, e.g., a CNN or RNN layer, than the original high-dimensional feature, and together with weight sharing the total model size can be significantly reduced.

Standard Pipeline for Speech Separation
Group Communication
GroupComm With Context Codec
Discussions
Data Simulation
Model Configurations
Training Configurations
Evaluation Metrics
RESULTS AND ANALYSIS
Experimental Results on GC3-DPRNN
Effect of Model Architectures for GroupComm
Effect of Overlap Between Groups
CONCLUSION AND FUTURE WORKS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call