Abstract

This paper presents a unified speech enhancement system to remove both background noise and interfering speech in serious noise environments by jointly utilizing the parabolic reflector model and neural beamformer. First, the amplification property of paraboloid is discussed, which significantly improves the Signal-to-Noise Ratio (SNR) of a desired signal. Therefore, an appropriate paraboloid channel is analyzed and designed through the boundary element method. On the other hand, a time-frequency masking approach and a mask-based beamforming approach are discussed and incorporated in an enhancement system. It is worth noticing that signals provided by the paraboloid and the beamformer are exactly complementary. Finally, these signals are employed in a learning-based fusion framework to further improve the system performance in low SNR environments. Experiments demonstrate that our system is effective and robust in five different noisy conditions (speech interfered with factory, pink, destroyer engine, volvo, and babble noise), as well as in different noise levels. Compared with the original noisy speech, significant average objective metrics improvements are about Δ STOI = 0.28, Δ PESQ = 1.31, Δ fwSegSNR = 11.9.

Highlights

  • Perceived quality and intelligibility of speech signals are degraded by pervasive noise

  • Traditional beamforming methods require a priori knowledge of the Direction of Arrival (DoA) or the transfer functions from an acoustic source to microphones [4]

  • ‘∗’ indicates convolution, and t indexes a time sample. yi (t) denotes the signal at microphone i, and s j (t) denotes the jth source signal. hi,j (t) defines the Room Impulse Response (RIR), which models the aspect of sound propagation from source to receiver

Read more

Summary

Introduction

Perceived quality and intelligibility of speech signals are degraded by pervasive noise. This presents challenges to many applications, such as speech communication, hearing aids, and speech recognition. For these applications, speech enhancement is crucial to recover signals from the noisy speech. Recent studies indicate that it is beneficial to extract a desired speech signal by beamforming in noisy and reverberant environments, especially in high-level background noise [2,3]. Traditional beamforming methods require a priori knowledge of the Direction of Arrival (DoA) or the transfer functions from an acoustic source to microphones [4]. According to the auditory masking effect, the time-frequency (T-F) masking technique applies a real-valued or binary mask on the signal’s spectrum to filter out unwanted components, because the mask reserves speech-dominant

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.