Abstract

To further improve the quality of the enhanced speech, it is appealing that more profound articulatory and auditory knowledge should be introduced into the speech enhancement model. Among these, harmonics seriously affect speech timbre and play a crucial role in speech intelligibility. Especially in the frequency domain, harmonics appear as the local maximum peaks of energy, which could be expected to serve as anchors to recover the distorted speech. In this paper, an explicit modeling method, harmonic attention, is presented, patching the harmonics with the help of residual ones. In order to maintain the spectral structure of speech during the processing and to enable the network to support harmonic modeling, a harmonic attention-based progressive enhancement network (HAPNet) is applied, which gradually approaches clean speech with stacked modules of harmonic attention. In addition, to make enhanced speech more consistent with hearing, a loss function based on the loudness power compression (LC-SNR) is used, which measures both magnitude and phase values with appropriate auditory effects. The experimental visualization indicates that the harmonic attention can capture and recover the harmonics of speech. And the objective evaluations show that the presented HAPNet and LC-SNR outperform the referenced methods. Furthermore, the presented model trained on 100 hours of data achieves competitive results with the referenced models trained on 3000+ hours of data, and one trained on 500 hours of data yields the state-of-the-art performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call