Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates.

Juan Vera-Diaz,Daniel Pizarro,Javier Macias-Guarasa

doi:10.3390/s18103418

Juan Vera-Diaz, Daniel Pizarro + Show 1 more

Open Access

https://doi.org/10.3390/s18103418

Copy DOI

Journal: Sensors (Basel, Switzerland)	Publication Date: Oct 12, 2018
Citations: 82	License type: CC BY 4.0

Affiliation: University of Alcalá

Abstract

This paper presents a novel approach for indoor acoustic source localization using microphone arrays, based on a Convolutional Neural Network (CNN). In the proposed solution, the CNN is designed to directly estimate the three-dimensional position of a single acoustic source using the raw audio signal as the input information and avoiding the use of hand-crafted audio features. Given the limited amount of available localization data, we propose, in this paper, a training strategy based on two steps. We first train our network using semi-synthetic data generated from close talk speech recordings. We simulate the time delays and distortion suffered in the signal that propagate from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results, evaluated on a publicly available dataset recorded in a real room, show that this approach is able to produce networks that significantly improve existing localization methods based on SRP-PHAT strategies and also those presented in very recent proposals based on Convolutional Recurrent Neural Networks (CRNN). In addition, our experiments show that the performance of our CNN method does not show a relevant dependency on the speaker’s gender, nor on the size of the signal window being used.

Highlights

The development and scientific research of advanced perceptual systems has notably grown during the last decades, and has experienced a tremendous rise in recent years due to the availability of increasingly sophisticated sensors, the use of computing nodes with higher and higher computational power, and the advent of powerful algorithmic strategies based on deep learning
We show the relative improvements of GMBF and Convolutional Neural Network (CNN) compared with Steered Response Power (SRP)-Phase Transform (PHAT)
The average MOTP value for the standard SRP-PHAT algorithm was between 76 cm and 96 cm, and for the GMBF, it was between 59 cm and 78 cm

Summary

Introduction

The development and scientific research of advanced perceptual systems has notably grown during the last decades, and has experienced a tremendous rise in recent years due to the availability of increasingly sophisticated sensors, the use of computing nodes with higher and higher computational power, and the advent of powerful algorithmic strategies based on deep learning (all of them entering the mass consumer market). The scientific works in these environments cover research areas ranging from basic sensor technologies to signal processing and pattern recognition. They open the pathway to the idea of systems being able to analyze human activities, providing us with advanced interaction capabilities and services. In this context, the localization of humans (being the most interesting element for perceptual systems) is a fundamental task that needs to be addressed so that the systems can start to provide higher level information on the activities being carried out. Further advanced interactions between humans and their physical environment cannot be fulfilled successfully

Objectives

Methods

Results

Conclusion