In real-world scenarios, dynamic ambient noise often degrades speech quality, highlighting the need for advanced speech enhancement techniques. Traditional methods, which rely on static embeddings as auxiliary features, struggle to address the complexities of varying noise conditions. To overcome this, we propose a Dual-stream Noise and Speech Information Perception (DNSIP) approach that dynamically detects and processes both noise and speech through innovative information extraction and suppression mechanisms. Initially, non-speech segments predominantly contain environmental noise, while speech segments carry information about the intended speaker. To handle this dynamic nature, real-time voice activity detection (VAD) is employed to accurately differentiate between speech and noise components. Building on VAD estimates, we propose an innovative information extraction framework that selectively extracts relevant noise and speech features from the noisy input, establishing a dual-stream network for concurrent noise and speech learning. To account for the temporal and spectral variability of noise and speech, a frequency-sequence attention mechanism is integrated, enhancing the model’s ability to learn contextual and spectral dependencies. Additionally, an information suppression module is introduced to minimize cross-stream interference by attenuating noise within the speech stream and suppressing speech content within the noise stream. The derived noise and speech spectrograms are then utilized to formulate a minimum mean square error log-spectral amplitude (MMSE-LSA) estimator for robust speech enhancement. Experimental evaluations on the WSJ0 and VCTK+DEMAND datasets demonstrate that our DNSIP approach surpasses existing state-of-the-art methods, underscoring its efficacy in challenging acoustic environments.
Read full abstract