Abstract

At a cocktail party, we can broadly monitor the entire acoustic scene to detect important cues (e.g., our names being called, or the fire alarm going off), or selectively listen to a target sound source (e.g., a conversation partner). It has recently been observed that individual neurons in the avian field L (analog to the mammalian auditory cortex) can display broad spatial tuning to single targets and selective tuning to a target embedded in spatially distributed sound mixtures. Here, we describe a model inspired by these experimental observations and apply it to process mixtures of human speech sentences. This processing is realized in the neural spiking domain. It converts binaural acoustic inputs into cortical spike trains using a multi-stage model composed of a cochlear filter-bank, a midbrain spatial-localization network, and a cortical network. The output spike trains of the cortical network are then converted back into an acoustic waveform, using a stimulus reconstruction technique. The intelligibility of the reconstructed output is quantified using an objective measure of speech intelligibility. We apply the algorithm to single and multi-talker speech to demonstrate that the physiologically inspired algorithm is able to achieve intelligible reconstruction of an “attended” target sentence embedded in two other non-attended masker sentences. The algorithm is also robust to masker level and displays performance trends comparable to humans. The ideas from this work may help improve the performance of hearing assistive devices (e.g., hearing aids and cochlear implants), speech-recognition technology, and computational algorithms for processing natural scenes cluttered with spatially distributed acoustic objects.

Highlights

  • Our sensory systems are constantly challenged with detecting, selecting and recognizing target objects in complex natural scenes

  • We modeled the spread of information across frequency channels with a Gaussian-shaped weighting function, centered around the center frequency (CF) of each frequency channel: wi; j 1⁄4 exp where i and j are the indices of frequency channels, and σ is the standard deviation

  • We built upon the network model for cortical responses, as described above, to design a physiologically inspired algorithm (PA) to process human speech in a CPP-like setting (Fig. 2A)

Read more

Summary

Introduction

Our sensory systems are constantly challenged with detecting, selecting and recognizing target objects in complex natural scenes. The problem of understanding a speaker in the midst of others (a.k.a the “cocktail party problem,” or CPP) remains a focus of intensive research in a diverse range of research fields (Cherry 1953; Haykin and Chen 2005; McDermott 2009; Lyon 2010). An impressive aspect of the CPP is the flexible spatial listening capabilities of normal-hearing listeners. A listener can broadly monitor (i.e., maintain awareness of) the entire auditory scene for important auditory cues, or select (i.e., attend to) a target speaker of a particular location. The selection of the most relevant target often requires careful monitoring of the entire acoustic scene, and the ability to flexibly switch between these two states is CHOU ET AL.: A Model for the Cocktail Party Problem essential in effectively solving the CPP. Though the specific differences in encoding across species remains an active research area, a key open question remains in both animals: how do flexible modes of listening emerge from spatially localized representations in the midbrain?

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call