The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target speech, but due to the limitation of the time-domain encoder–decoder framework, these separation models commonly improve the separation performance by setting a small convolution kernel size of encoder to increase the length of the coded sequence, which will result in increased computational complexity and training costs for the model. Therefore, in this paper, we propose an efficient time-domain speech separation model using short-sequence encoder–decoder framework (ESEDNet). In this model, we construct a novel encoder–decoder framework to accommodate short encoded sequences, where the encoder consists of multiple convolution and downsampling operations to reduce length of high-resolution sequence, while the decoder utilizes the encoded features to reconstruct the fine-detailed speech sequence of the target speaker. Since the output sequence of the encoder is shorter, when combined with our proposed multi-temporal resolution Transformer separation network (MTRFormer), ESEDNet can efficiently obtains separation masks for the short encoded feature sequence. Experiments show that compared with previous state-of-the-art (SOTA) methods, ESEDNet is more efficient in terms of computational complexity, training speed and GPU memory usage, while maintaining competitive separation performance.
Read full abstract