Abstract

Nowadays, time-domain features see wide use in speech enhancement (SE) networks such as frequency-domain features to achieve excellent performance in eliminating noise from input utterances. This study primarily investigates how to extract information from time-domain utterances to create more effective features in SE. We extend our recent work by employing sub-signals which dwell in multiple acoustic frequency bands in the time domain and integrating them into a unified time-domain feature set. The discrete wavelet transform (DWT) is applied to decompose each input frame signal to obtain sub-band signals, and a projection fusion process is performed on these signals to create the ultimate features. The corresponding fusion strategy is either bi-projection fusion (BPF) or multiple projection fusion (MPF). In short, MPF exploits the softmax function to replace the sigmoid function in order to create ratio masks for multiple feature sources. The concatenation of fused DWT features and time features serves as the encoder output of two celebrated SE frameworks, the fully convolutional time-domain audio separation network (Conv-TasNet) and the dual-path transformer network (DPTNet), to estimate the mask and then produce the enhanced time-domain utterances. The evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks, and the results reveal that the proposed method achieves higher speech quality and intelligibility than the original Conv-TasNet that uses time features only, indicating that the fusion of DWT features created from the input utterances can benefit time features to learn a superior Conv-TasNet/DPTNet network in SE.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call