In multimedia intelligent systems, speech enhancement is commonly employed to improve the quality of speech signals, making them clearer and more natural. Current deep learning-based speech enhancement models typically treat noise as a unified entity and aim to separate it from the target speech. In this paper, inspired by the cognitive behavior of the human brain when observing noisy speech spectrograms, we decompose the spectral energy of noise into regular and random components. We propose an auxiliary-model-based speech enhancement framework that better suppresses noise components closely resembling speech features. Firstly, we introduce a voiceprint segmentation network (VSnet) that partitions noisy speech into voiceprint and non-voiceprint regions. Subsequently, we present a noise reconstruction network (NRnet) that utilizes noise information from non-voiceprint regions to reconstruct and suppress the regular noise components within the voiceprint region. Finally, we construct a combination of a model dedicated to suppressing random components (RANnet) and a speech enhancement model (SEnet), and train them synchronously. By sharing encoder parameters, SEnet is compelled to reduce the extraction of regular noise features from the original noisy speech, contributing to improving speech quality generated through the decoder. Experimental results on public Voickbank-DEMAND and DNS-challenge 2020 datasets demonstrate that our approach achieves state-of-the-art performance.
Read full abstract