Abstract

As automatic speech recognition evolves, the deployment of voice user interface has boomingly expanded. Especially since the COVID-19 pandemic, VUI has gained more attention in online communication owing to its non-contact property. However, VUI struggles to be applied in public scenes due to the degradation of received audio signals caused by various ambient noises. In this paper, we propose Wavoice , the first noise-resistant multi-modal speech recognition system that fuses two distinct voice sensing modalities, i.e., millimeter-wave (mmWave) signals and audio signals from a microphone, together. One key contribution is to model the inherent correlation between mmWave and audio signals. Based on it, Wavoice facilitates the real-time noise-resistant voice activity detection and user targeting from multiple speakers. Additionally, we elaborate on two novel modules for multi-modal fusion embedded into the neural network, leading to accurate speech recognition. Extensive experiments prove the effectiveness of Wavoice under adverse conditions, that is, the character recognition error rate below 1 \(\% \) in a range of 7 meters. In terms of robustness and accuracy, Wavoice considerably outperforms existing audio-only speech recognition methods with lower character error rate and word error rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call