Abstract
Deep neural networks have been recently shown to capture intricate information transformation of signals from the sensory profiles to semantic representations that facilitate recognition or discrimination of complex stimuli. In this vein, convolutional neural networks (CNNs) have been used very successfully in image and audio classification. Designed to imitate the hierarchical structure of the nervous system, CNNs reflect activation with increasing degrees of complexity that transform the incoming signal onto object-level representations. In this work, we employ a CNN trained for large-scale audio object classification to gain insights about the contribution of various audio representations that guide sound perception. The analysis contrasts activation of different layers of a CNN with acoustic features extracted directly from the scenes, perceptual salience obtained from behavioral responses of human listeners, as well as neural oscillations recorded by electroencephalography (EEG) in response to the same natural scenes. All three measures are tightly linked quantities believed to guide percepts of salience and object formation when listening to complex scenes. The results paint a picture of the intricate interplay between low-level and object-level representations in guiding auditory salience that is very much dependent on context and sound category.
Highlights
Over the past few years, convolutional neural networks (CNNs) have revolutionized machine perception, in the domains of image understanding, speech and audio recognition, and multimedia analytics (Krizhevsky et al, 2012; Karpathy et al, 2014; Cai and Xia, 2015; Simonyan and Zisserman, 2015; He et al, 2016; Hershey et al, 2017; Poria et al, 2017)
A CNN is a form of a deep neural network (DNN) where most of the computation are done with trainable kernel that are slid over the entire input
The current study leverages the complex hierarchy afforded by CNNs trained on audio classification to explore parallels between network activation and auditory salience in natural sounds measured through a variety of modalities
Summary
Over the past few years, convolutional neural networks (CNNs) have revolutionized machine perception, in the domains of image understanding, speech and audio recognition, and multimedia analytics (Krizhevsky et al, 2012; Karpathy et al, 2014; Cai and Xia, 2015; Simonyan and Zisserman, 2015; He et al, 2016; Hershey et al, 2017; Poria et al, 2017). A CNN is a form of a deep neural network (DNN) where most of the computation are done with trainable kernel that are slid over the entire input. These networks implement hierarchical architectures that mimic the biological structure of the human sensory system. They are organized in a series of processing layers that perform different transformations of the incoming signal, “learning” information in a distributed topology. By constraining the selectivity of units in these layers, nodes in the network have emergent “receptive fields,” allowing them to learn from local information in the input and structure processing in a distributed way; much like neurons in the brain have receptive fields with localized connectivity organized in topographic
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.