Neuromorphic audio–visual sensor fusion on a sound-localizing robot

Vincent Chan

doi:10.3389/fnins.2012.00021

Abstract

This paper presents the first robotic system featuring audio–visual (AV) sensor fusion with neuromorphic sensors. We combine a pair of silicon cochleae and a silicon retina on a robotic platform to allow the robot to learn sound localization through self motion and visual feedback, using an adaptive ITD-based sound localization algorithm. After training, the robot can localize sound sources (white or pink noise) in a reverberant environment with an RMS error of 4–5° in azimuth. We also investigate the AV source binding problem and an experiment is conducted to test the effectiveness of matching an audio event with a corresponding visual event based on their onset time. Despite the simplicity of this method and a large number of false visual events in the background, a correct match can be made 75% of the time during the experiment.

Highlights

Neuromorphic engineering, introduced by Carver Mead in the late 1980s, is a multidisciplinary approach to artificial intelligence, building bio-inspired sensory and processing systems by combining neuroscience, signal processing, and analog VLSI (Mead, 1989; Mead, 1990)
In a previous paper in this journal, we introduced and tested an adaptive ITD-based1 sound localization algorithm that employs a pair of silicon cochleae, the AER EAR, and supports online learning (Chan et al, 2010)
We investigate the possibility of using self motion and visual feedback to train a robot to accurately localize a sound source in a reverberant

Summary

Introduction

Neuromorphic engineering, introduced by Carver Mead in the late 1980s, is a multidisciplinary approach to artificial intelligence, building bio-inspired sensory and processing systems by combining neuroscience, signal processing, and analog VLSI (Mead, 1989; Mead, 1990). Neuromorphic engineering follows several design paradigms taken from biology and these are: (1) pre-processing at the sensor front-end to increase dynamic range; (2) adaptation over time to learn and minimize systematic errors; (3) efficient use of transistors for low precision computation; (4) parallel processing; and (5) signal representation by discrete events (spikes) for efficient and robust communication. While audio–visual (AV) sensor fusion has been studied for a long time in the field of robotics, with examples such as (Bothe et al, 1999; Wong et al, 2008), to our knowledge, there are no neuromorphic systems which combine sensors of different modalities

Methods

Results

Conclusion