Because the information flow received by the human auditory system exceeds the processing capacity of the brain, neural mechanisms engage and guide attention toward prominent parts of the auditory scene. Several computational models for auditory saliency have been proposed recently. Most of these are concerned with speech recognition, and therefore apply high temporal and spectral precision to relatively short sound fragments. Here, a simplified model is described that specifically targets the long exposure times usually considered in soundscape research. The model trades temporal and spectral accuracy for computational speed, but nevertheless implements the key elements that are present in the calculation of complex auditory saliency maps. A simplified “cochleagram” is calculated from the 1/3-octave band spectrogram using the Zwicker model for specific loudness. Saliency is determined based on spectro-temporal irregularities, extracted in parallel at different feature scales, using a center-surround mechanism. Finally, conspicuous peaks are selected using within-feature and between-feature competitions. The model is shown to behave as expected for a number of typical sounds. As an illustration, saliency calculation results for a set of recordings in urban parks are compared with other acoustical descriptors and with perceptual attribute scales from questionnaire studies.
Read full abstract