Modeling the effect of learning voices on the perception of speech

Lynne C Nygaard,Michael Kalish

doi:10.1121/1.409445

Abstract

The speech signal carries information about a talker’s voice and about linguistic content along the same acoustic dimensions. Traditionally, the unraveling of talker and linguistic information has been characterized as a normalization process in which talker information is discarded in the listener’s quest for the abstract, idealized linguistic units thought to underlie speech perception. Recent studies, however, have demonstrated that the processing of voice and the processing of linguistic content are not independent. Nygaard etal. [Psychol. Sci. 5, 42–46 (1994)] have found that learning a talker’s voice facilitates subsequent phonetic analysis and suggest that familiarity with a talker’s voice involves long-term changes in the speech perceptual system. In order to explain this phenomenon, a model is proposed in which a single set of representational elements is responsible for preserving information about both talker and phonetic content. This model was instantiated as a recurrent auto-associated network trained to reproduce cochleagrams [R. F. Lyon, Proc. IEEE ICASSP ’82] of a restricted set of words, for a given set of talkers. At test, input representations included cochleagrams of novel words produced by both ‘‘familiar’’ and novel talkers. The model made fewer errors reproducing patterns from talkers encountered in training than when reproducing patterns from novel talkers. The behavior of the model was due to (1) the assumption of integrated representation, (2) significant change (perceptual learning) during training, and (3) the specifics of the neural network architecture.

Full Text