Abstract

The “cocktail party problem” refers to the ability of human listeners to separate the acoustic signal reaching their ears into its individual components, corresponding to individual sound sources in the environment. Despite this phenomenon appearing trivial for humans, solving the cocktail party problem computationally remains an ambitious challenge. The approach used in this paper takes inspiration from human strategies for separating an acoustic environment into distinct perceptual auditory streams. A series of time-frequency-based features, analogous to those thought to emerge at various stages in the human auditory processing pathway, are derived from biaural auditory inputs. These feature vectors are used as inputs to an unsupervised cluster analysis used to group feature values that are assumed to correspond to the same object. Reconstructed auditory streams are then correlated to the original components used to create the auditory scene. Our model is capable of reconstructing streams that correlate to the original components (r = 0.3-0.7) used to create the complex auditory scene. The success of the reconstructions is largely dependent on the signal-to-noise ratio of the components of the auditory scene.

Highlights

  • In everyday listening environments, we are often challenged with separating a myriad of sound signals that arrives at our ears into distinct sound sources

  • The features extracted from the source signal have a physiological and/or psychological basis for the formation of auditory streams [9]. Such features may be based on frequency selectivity of the basilar membrane of the cochlea [10], the spectrotemporal receptive fields (STRFs) of the auditory cortex [4, 11, 12], measurements of pitch and timbre [13], and localization via measuring the interaural time difference (ITD) [14]. These features are fed into a cluster analysis stage that attempts to create groupings of similar features within the space; the goal is to form groups that correspond to individual auditory objects

  • When comparing the separated piano stream to the original component it is clear that a substantial portion of information is lost during separation from the mixture; enough information is present in the reconstruction to recognize the corresponding original component

Read more

Summary

Introduction

We are often challenged with separating a myriad of sound signals that arrives at our ears into distinct sound sources. The features extracted from the source signal have a physiological and/or psychological basis for the formation of auditory streams [9] Such features may be based on frequency selectivity of the basilar membrane of the cochlea [10], the spectrotemporal receptive fields (STRFs) of the auditory cortex [4, 11, 12], measurements of pitch and timbre [13], and localization via measuring the interaural time difference (ITD) [14].

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.