Abstract
Our current understanding of how the brain segregates auditory scenes into meaningful objects is in line with a Gestaltism framework. These Gestalt principles suggest a theory of how different attributes of the soundscape are extracted then bound together into separate groups that reflect different objects or streams present in the scene. These cues are thought to reflect the underlying statistical structure of natural sounds in a similar way that statistics of natural images are closely linked to the principles that guide figure-ground segregation and object segmentation in vision. In the present study, we leverage inference in stochastic neural networks to learn emergent grouping cues directly from natural soundscapes including speech, music and sounds in nature. The model learns a hierarchy of local and global spectro-temporal attributes reminiscent of simultaneous and sequential Gestalt cues that underlie the organization of auditory scenes. These mappings operate at multiple time scales to analyze an incoming complex scene and are then fused using a Hebbian network that binds together coherent features into perceptually-segregated auditory objects. The proposed architecture successfully emulates a wide range of well established auditory scene segregation phenomena and quantifies the complimentary role of segregation and binding cues in driving auditory scene segregation.
Highlights
We live in busy environments, and our surrounds continuously flood our sensory system with complex information that needs to be analyzed in order to make sense of the world around us
A number of Gestalt principles have been posited as indispensable anchors used by the brain to guide the segregation of auditory scenes into perceptually meaningful objects [8, 47, 58]
These comprise a wide variety of cues; for instance harmonicity which couples harmonicallyrelated frequency channels together, common fate which favors sound elements that co-vary in amplitude, and common onsets which groups components that share a similar starting time and to a lesser degree a common ending time
Summary
We live in busy environments, and our surrounds continuously flood our sensory system with complex information that needs to be analyzed in order to make sense of the world around us This process, labeled scene analysis, is common across all sensory modalities including vision, audition and olfaction [1]. Our brain relies on innate dispositions that aid this process and help guide the organization of patterns into perceived objects [2] These dispositions, referred to as Gestalt principles, inform our current understanding of the perceptual organization of scenes [3, 4]. The sensory mixture is decomposed into feature elements, believed to be the building blocks of the scene These features reflect the physical nature of sources in the scene, the state and structure of the environment itself, as well as perceptual mappings of these attributes as viewed by the sensory system. This segregation stage is modeled using feature analyses which map the sensory signal into its building blocks ranging from simple components (e.g. frequency channels) to dimensionally-complex kernels [6, 7]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.