Abstract

Perception is thought to be shaped by the environments for which organisms are optimized. These influences are difficult to test in biological organisms but may be revealed by machine perceptual systems optimized under different conditions. We investigated environmental and physiological influences on pitch perception, whose properties are commonly linked to peripheral neural coding limits. We first trained artificial neural networks to estimate fundamental frequency from biologically faithful cochlear representations of natural sounds. The best-performing networks replicated many characteristics of human pitch judgments. To probe the origins of these characteristics, we then optimized networks given altered cochleae or sound statistics. Human-like behavior emerged only when cochleae had high temporal fidelity and when models were optimized for naturalistic sounds. The results suggest pitch perception is critically shaped by the constraints of natural environments in addition to those of the cochlea, illustrating the use of artificial neural networks to reveal underpinnings of behavior.

Highlights

  • IntroductionThrough optimization for the training task, the DNNs should learn to use whichever peripheral cues best allow them to extract F0

  • We developed a model of pitch perception by optimizing artificial neural networks to estimate the fundamental frequency of their acoustic input

  • The networks were trained on simulated auditory nerve representations of speech and music embedded in background noise

Read more

Summary

Introduction

Through optimization for the training task, the DNNs should learn to use whichever peripheral cues best allow them to extract F0. To make the F0 estimation task more difficult and to simulate naturalistic listening conditions, each speech or instrument excerpt in the training dataset was embedded in natural background noise. The signal-to-noise ratio for each training example was drawn uniformly between −10 dB and +10 dB. Noise source clips were taken from a subset of the AudioSet corpus[78], screened to remove nonstationary sounds (e.g., speech or music). To ensure the F0 estimation task remained well defined for the noisy stimuli, background noise clips were screened for periodicity by computing their autocorrelation functions. Noise clips with peaks greater than 0.8 at lags greater than 1 ms in their normalized autocorrelation function were excluded

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call