Abstract

Voice activity detection (VAD) aims for detecting the presence of speech in a given input signal, and is often the first step in voice -based applications such as speech communication systems. In the context of personal devices, own voice detection (OVD) is a sub-task of VAD, since it targets speech detection of the person wearing the device, while ignoring other speakers in the presence of interference signals. This article first summarizes recent single and multi-microphone, multi-sensor, and hearing aids related VAD techniques. Then, a wearable in-ear device equipped with multiple microphones and an accelerometer is investigated for the OVD task using a neural network with input embedding and long short-term memory (LSTM) layers. The device picks up the user’s speech signal through air as well as vibrations through the body. However, besides external sounds the device is sensitive to user’s own non-speech vocal noises (e.g. coughing, yawning, etc.) and movement noise caused by physical activities. A signal mixing model is proposed to produce databases of noisy observations used for training and testing the frame-by-frame OVD method. The best model’s performance is further studied in the presence of different recorded interference. An ablation study reports the model’s performance on sub-sets of sensors. The results show that the OVD approach is robust towards both user motion and user generated vocal non-speech sounds in the presence of loud external interference. The approach is suitable for real-time operation and achieves 90-96 % OVD accuracy in challenging use scenarios with a short 10 ms processing frame length.

Highlights

  • This work investigates the problem of voice activity detection (VAD), which can be modeled as a state machine with two discrete states of ”speech” and ”no speech” [1]

  • The audio data was processed at 16 kHz in 10 ms (Hann windowed) frames (Nw = 160 samples) with 5 ms overlap (No = 80 samples) between adjacent frames

  • The own voice detection (OVD) features were extracted with either B = {50, 100} mel bands for each channel’s amplitude values and phase-difference pairs, resulting in a feature vector of length F = {500, 1000}

Read more

Summary

Introduction

This work investigates the problem of voice activity detection (VAD), which can be modeled as a state machine with two discrete states of ”speech” and ”no speech” [1]. E.g., telephony [2], keyword detection [3], automatic speech recognition [4], acoustic source localization and tracking [5], and speech enhancement [6] benefit from VAD. Some of these applications necessitate operation in real time with low latency, e.g., telephony conversation [2]. Traditional VAD algorithms were developed to detect speech in close-talk telephony with relatively high SNR, which decreases with increased speaker distance, reverberation, and presence of external sounds [5]. Deep neural networks (DNNs) have resulted in state-of-the art performance in VAD accuracy over past years in challenging conditions

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call