Read my lips: An animated face helps communicate musical lyrics.

Miguel Hidalgo-Barnes,Dominic W Massaro

doi:10.1037/h0094037

Abstract

Understanding the lyrics of many songs, not just contemporary songs, is sometimes difficult. Watching the talker's face improves speech understanding when the speech is degraded by noise or by hearing difficulty. To explore whether the face can be similarly helpful in music, 34 phrases from the song Pressman by Primus (1993) were played to thirteen college students. These phrases were aligned with Baldi, a computer-animated talking head. There were three presentation conditions: acoustic presentation of the lyrics, Baldi's mouthing of the lyrics, and the acoustic lyrics aligned with Baldi. For all three conditions, the students were asked to watch and listen and to immediately type the words they thought were being presented. Performance was significantly better in the bimodal condition than in the auditory condition, showing that visual information from the face contributes to the recognition of musical lyrics. Although the contribution of the face was significant, it was somewhat smaller than that found in speech. A variety of analogies have been drawn between language and music (Besson & Schon, 2003; Bernstein, 1976; Jackendoff, 1992; Lerdahl, & Jackendoff, 1983; Patel, 2003). Both domains might be considered to be forms of communication, although this characterization is somewhat limited in scope. If we define linguistic (literal word aspects of speech) and paralinguistic (nonliteral nonword aspects of speech: pitch, volume, stress, speed) dimensions of communication, we can claim that language emphasizes the linguistic relative to the paralinguistic whereas musical pieces emphasize the paralinguistic relative to the linguistic. In this research, we test whether the visual modality influences music perception in the same way that it does in spoken language perception. It has been repeatedly shown that many things affect the understandability of verbal communication. In addition to the auditory and contextual information received by listening to a speaker, comprehension is aided visually by being able to see the speaker's face while talking. This phenomenon is most evident in cases where the verbal information is degraded in some way, such as with noise or hearing impairment (Erber, 1972;Kisor, 1990; Massaro, 1987; Summerfield, 1987). Although both practitioners of speech therapy and speech science were well aware of the potential richness of speech information in the face, the McGurk illusion (hearing inappropriately because of watching the face) captured the imagination of researchers. The McGurk effect or some variant of it has been replicated and studied across different languages (English, Japanese, Dutch, Spanish, French, German, Cantonese, Finnish, and Thai); across eight decades of the lifespan from infancy onward; and from the perception of nonsense syllables to the understanding of prose. Emerging from this impressive body of activity is the robustness of the phenomenon, holding up independently of the intention of the perceiver and also existing in analogous fashion in other domains such as perceiving emotion from the face and the voice (for a review, see Massaro, 1998). Speech reading, or the ability to obtain speech information from the face, is robust in that perceivers are fairly good at speech reading even when they are not looking directly at the talker's lips. Furthermore, accuracy is not dramatically reduced when the facial image is blurred (because of poor vision, for example), when the face is viewed from above, below, or in profile, or when there is a large distance between the talker and the viewer (Jordan & Sergeant, 2000; Massaro, 1998, Chapter 14). In addition, people naturally integrate visible speech with audible speech even when the temporal occurrence of the two sources is displaced by about a 1/5 of a second (Massaro, 1998, Chapter 3). These findings indicate that speech reading is highly functional in a variety of nonoptimal situations. Complementarity of auditory and visual information simply means that one of the sources is most informative in those cases in which the other is weakest. …

Full Text