Online Own Voice Detection for a Multi-Channel Multi-Sensor In-Ear Device

Pasi Pertila,Eemi Fagerlund,Anu Huttunen,Ville Myllyla

doi:10.1109/jsen.2021.3122936

Pasi Pertila, Eemi Fagerlund + Show 2 more

Open Access

https://doi.org/10.1109/jsen.2021.3122936

Copy DOI

Journal: IEEE Sensors Journal	Publication Date: Dec 15, 2021
Citations: 2	License type: CC BY 4.0

Affiliation: Tampere University

Abstract

Voice activity detection (VAD) aims for detecting the presence of speech in a given input signal, and is often the first step in voice -based applications such as speech communication systems. In the context of personal devices, own voice detection (OVD) is a sub-task of VAD, since it targets speech detection of the person wearing the device, while ignoring other speakers in the presence of interference signals. This article first summarizes recent single and multi-microphone, multi-sensor, and hearing aids related VAD techniques. Then, a wearable in-ear device equipped with multiple microphones and an accelerometer is investigated for the OVD task using a neural network with input embedding and long short-term memory (LSTM) layers. The device picks up the user’s speech signal through air as well as vibrations through the body. However, besides external sounds the device is sensitive to user’s own non-speech vocal noises (e.g. coughing, yawning, etc.) and movement noise caused by physical activities. A signal mixing model is proposed to produce databases of noisy observations used for training and testing the frame-by-frame OVD method. The best model’s performance is further studied in the presence of different recorded interference. An ablation study reports the model’s performance on sub-sets of sensors. The results show that the OVD approach is robust towards both user motion and user generated vocal non-speech sounds in the presence of loud external interference. The approach is suitable for real-time operation and achieves 90-96 % OVD accuracy in challenging use scenarios with a short 10 ms processing frame length.

Highlights

This work investigates the problem of voice activity detection (VAD), which can be modeled as a state machine with two discrete states of ”speech” and ”no speech” [1]
The audio data was processed at 16 kHz in 10 ms (Hann windowed) frames (Nw = 160 samples) with 5 ms overlap (No = 80 samples) between adjacent frames
The own voice detection (OVD) features were extracted with either B = {50, 100} mel bands for each channel’s amplitude values and phase-difference pairs, resulting in a feature vector of length F = {500, 1000}

Summary

Introduction

This work investigates the problem of voice activity detection (VAD), which can be modeled as a state machine with two discrete states of ”speech” and ”no speech” [1]. E.g., telephony [2], keyword detection [3], automatic speech recognition [4], acoustic source localization and tracking [5], and speech enhancement [6] benefit from VAD. Some of these applications necessitate operation in real time with low latency, e.g., telephony conversation [2]. Traditional VAD algorithms were developed to detect speech in close-talk telephony with relatively high SNR, which decreases with increased speaker distance, reverberation, and presence of external sounds [5]. Deep neural networks (DNNs) have resulted in state-of-the art performance in VAD accuracy over past years in challenging conditions

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Online Own Voice Detection for a Multi-Channel Multi-Sensor In-Ear Device

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Sensors Journal

Lead the way for us

Similar Papers

Speaker identification using convolutional-long short-term memory neural networks
Serkan Tokgoz ... Issa M Panahi
The Journal of the Acoustical Society of America | VOL. 146
Serkan Tokgoz, et. al.Serkan Tokgoz ... Issa M Panahi
01 Oct 2019
The Journal of the Acoustical Society of America | VOL. 146

An efficient un-supervised Voice Activity Detector for clean speech
Mamta Kumari ... Israj Ali
-
Mamta Kumari, et. al.Mamta Kumari ... Israj Ali
01 Nov 2015
01 Nov 2015

Speech enhancement using long short term memory with trained speech features and adaptive wiener filter.
Anil Garg
Multimedia tools and applications | VOL. 82
Anil GargAnil Garg
14 Jul 2022
Multimedia tools and applications | VOL. 82

A Comparison of Boosted Deep Neural Networks for Voice Activity Detection
Harshit Krishnakumar ... Donald S Williamson
-
Harshit Krishnakumar, et. al.Harshit Krishnakumar ... Donald S Williamson
01 Nov 2019
01 Nov 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Online Own Voice Detection for a Multi-Channel Multi-Sensor In-Ear Device

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Sensors Journal