Abstract

Bone-conducted (BC) speech captures speech signals based on the vibrations of a speaker’s skull. It is thus not affected by noise sources from environments and hence exhibits better noise-resistance capabilities than air-conducted (AC) speech. Although the quality and intelligibility of the BC speech degrade due to the nature of the solid vibration, BC speech can be utilized as an auxiliary source to jointly improve the performance of speech enhancement. In this paper, we propose an end-to-end multi-modal model for time-domain speech enhancement at low signal-to-noise ratios. The model utilizes both noisy AC speech and synchronized BC speech as the input. It takes an encoder-decoder architecture, where an involution network is used to estimate the mask of clean speech component, and the mask is then applied to remove the noise component. We compared the proposed method with several state-of-the-art multi-modal and single-modal methods on an air- and bone-conducted multi-modal corpus. Experimental results demonstrate that the proposed approach outperforms the comparison methods in terms of the speech quality and intelligibility of the enhanced speech. When applied to speech recognition, the enhanced speech significantly reduces the error rate.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.