V-Speech

Hong Lu,Jonathan Huang,Lama Nachman,Willem Marco Beltman,Paulo Lopez-Meyer,Héctor A Cordourier Maruri

doi:10.1145/3287058

Abstract

Smart glasses are often used in public environments or industrial scenarios that are relatively noisy. Background noise and sound from competing speakers deteriorate voice communication or performance of automatic speech recognition (ASR). Typically, signal processing techniques are used to reduce noise and enhance voice quality, but they have limitations in performance, hardware and/or computing resources. Voice capturing techniques using bone conducting on the head have been proposed in some experimental and commercial devices, with good robustness against environmental noise, but limited by signal distortions inherent to the capturing method. We present V-Speech, a novel sensing and signal processing solution that enables speech recognition and human-to-human communication in very noisy environments. It captures the voice signal with a vibration sensor located in the nasal pads of smart glasses and performs a transformation to the sensor signal in order to mimic that of a regular microphone in low noise conditions. The signal transformation is key, as it eliminates the "nasal distortion" that is introduced for nasal phonemes in the speech induced vibrations of the nasal bone. The output of V-Speech has low noise, sounds natural, and can be used in voice communication or as input to an off-the-shelf ASR service. We evaluated V-Speech in noise-free and noisy conditions with 30 volunteer speakers uttering 145 phrases and validated its performance on ASR engines and with assessments of voice quality using the Perceptual Evaluation of Speech Quality (PESQ) metric. The results show in extreme noise conditions a mean improvement of 50% for Word Error Rate (WER), and 1.0 on a scale of 5.0 for PESQ. In addition, real life recordings were made under various representative noise conditions, some with sound pressure levels of 93 dBA, which require hearing protection. Subjective listening tests were conducted according to a modified ITU P.835 approach to determine intelligibility, naturalness and overall quality. Under these extreme conditions, where V-Speech achieved 30 dB SNR, subjective results show the speech is intelligible, and the naturalness of the speech is rated as fair to good. This enables clear voice communication in challenging work environments, for example in places with industrial, factory, mining and construction noise. With our proposed smart switching technique between a regular microphone signal and V-Speech, the optimal quality can be maintained from low to high noise conditions.

Full Text