Abstract

Bone-conducted (BC) speech captures speech signals based on the vibrations of a speaker’s skull. It is thus not affected by noise sources from environments and hence exhibits better noise-resistance capabilities than air-conducted (AC) speech. Although the quality and intelligibility of the BC speech degrade due to the nature of the solid vibration, BC speech can be utilized as an auxiliary source to jointly improve the performance of speech enhancement. In this paper, we propose an end-to-end multi-modal model for time-domain speech enhancement at low signal-to-noise ratios. The model utilizes both noisy AC speech and synchronized BC speech as the input. It takes an encoder-decoder architecture, where an involution network is used to estimate the mask of clean speech component, and the mask is then applied to remove the noise component. We compared the proposed method with several state-of-the-art multi-modal and single-modal methods on an air- and bone-conducted multi-modal corpus. Experimental results demonstrate that the proposed approach outperforms the comparison methods in terms of the speech quality and intelligibility of the enhanced speech. When applied to speech recognition, the enhanced speech significantly reduces the error rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call