Abstract
A new approach for the segregation of monaural sound mixtures is presented based on the principle of temporal coherence and using auditory cortical representations. Temporal coherence is the notion that perceived sources emit coherently modulated features that evoke highly-coincident neural response patterns. By clustering the feature channels with coincident responses and reconstructing their input, one may segregate the underlying source from the simultaneously interfering signals that are uncorrelated with it. The proposed algorithm requires no prior information or training on the sources. It can, however, gracefully incorporate cognitive functions and influences such as memories of a target source or attention to a specific set of its attributes so as to segregate it from its background. Aside from its unusual structure and computational innovations, the proposed model provides testable hypotheses of the physiological mechanisms of this ubiquitous and remarkable perceptual ability, and of its psychophysical manifestations in navigating complex sensory environments.
Highlights
Humans and animals can attend to a sound source and segregate it rapidly from a background of many other sources, with no learning or prior exposure to the specific sounds
Some rely on prior information to segregate a specific target source or voice, and are usually able to reconstruct it with excellent quality [7]
Others are constrained by a single microphone and have instead opted to compute the spectrogram of the mixture, and to decompose it into separate sources relying on heuristics, training, mild constraints on matrix factorizations [9,10,11], spectrotemporal masks [12], and gestalt rules [1,13,14]
Summary
Humans and animals can attend to a sound source and segregate it rapidly from a background of many other sources, with no learning or prior exposure to the specific sounds. Some rely on prior information to segregate a specific target source or voice, and are usually able to reconstruct it with excellent quality [7] Another class of algorithms relies on the availability of multiple microphones and the statistical independence among the sources to separate them, using for example ICA approaches or beam-forming principles [8]. A different class of approaches emphasizes the biological mechanisms underlying this process, and assesses both their plausibility and ability to replicate faithfully the psychoacoustics of stream segregation (with all their strengths and weaknesses) Examples of the latter approaches include models of the auditory periphery that explain how simple tone sequences may stream [15,16,17], how pitch modulations can be extracted and used to segregate sources of different pitch [18,19,20], and models that handle more elaborate sound sequences and bistable perceptual phenomena [10,21,22,23]. It is fair to say, that the diversity of approaches and the continued strong interest in this problem suggest that no algorithm has yet achieved sufficient success to render the ‘‘cocktail party problem" solved from a theoretical, physiological, or applications point of view
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.