Speech super-resolution aims to predict a high-resolution speech signal from its low-resolution counterpart. The previous models usually perform this task at a fixed sampling rate, reconstructing only high-frequency spectrogram components and merging them with low-frequency ones in noise-free cases. These methods achieve high accuracy, but they are less effective in real-world settings, where ambient noise and flexible sampling rates are presented. To develop a robust model that fits practical applications, in this work, we introduce Super Denoise Net (SDNet), a neural network for noise-robust super-resolution with flexible input sampling rates. To this end, SDNet's design includes gated and lattice convolution blocks for enhanced repair and temporal-spectral information capture. The frequency transform blocks are employed to model long frequency dependencies, and a multi-scale discriminator is proposed to facilitate the multi-adversarial loss training. The experiments show that SDNet outperforms current state-of-the-art noise-robust speech super-resolution models on multiple test sets, indicating its robustness and effectiveness in real-world scenarios.
Read full abstract