Abstract

In this paper, we study the time-domain neural network approach for speech bandwidth extension. We propose a network architecture, named multi-scale fusion neural network (MfNet), that gradually restores the low-frequency signal and predicts the high-frequency signal through the exchange of information across different scale representations. We propose a training scheme to optimize the network with a combination of perceptual loss and time-domain adversarial loss. Experiments show the proposed multi-scale fusion network consistently outperforms the competing methods in terms of perceptual evaluation of speech quality (PESQ), signal to distortion rate (SDR), signal to noise ratio (SNR), log-spectral distance (LSD) and word error rate (WER). More promisingly, the multi-scale fusion network requires only 10% of the parameters of the time-domain reference baseline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call