Considerations in Audio-Visual Interaction Models: An ERP Study of Music Perception by Musicians and Non-musicians.

Marzieh Sorati,Dawn M Behne

doi:10.3389/fpsyg.2020.594434

Marzieh Sorati, Dawn M Behne

Open Access

https://doi.org/10.3389/fpsyg.2020.594434

Copy DOI

Abstract

Previous research with speech and non-speech stimuli suggested that in audiovisual perception, visual information starting prior to the onset of corresponding sound can provide visual cues, and form a prediction about the upcoming auditory sound. This prediction leads to audiovisual (AV) interaction. Auditory and visual perception interact and induce suppression and speeding up of the early auditory event-related potentials (ERPs) such as N1 and P2. To investigate AV interaction, previous research examined N1 and P2 amplitudes and latencies in response to audio only (AO), video only (VO), audiovisual, and control (CO) stimuli, and compared AV with auditory perception based on four AV interaction models (AV vs. AO+VO, AV-VO vs. AO, AV-VO vs. AO-CO, AV vs. AO). The current study addresses how different models of AV interaction express N1 and P2 suppression in music perception. Furthermore, the current study took one step further and examined whether previous musical experience, which can potentially lead to higher N1 and P2 amplitudes in auditory perception, influenced AV interaction in different models. Musicians and non-musicians were presented the recordings (AO, AV, VO) of a keyboard /C4/ key being played, as well as CO stimuli. Results showed that AV interaction models differ in their expression of N1 and P2 amplitude and latency suppression. The calculation of model (AV-VO vs. AO) and (AV-VO vs. AO-CO) has consequences for the resulting N1 and P2 difference waves. Furthermore, while musicians, compared to non-musicians, showed higher N1 amplitude in auditory perception, suppression of amplitudes and latencies for N1 and P2 was similar for the two groups across the AV models. Collectively, these results suggest that when visual cues from finger and hand movements predict the upcoming sound in AV music perception, suppression of early ERPs is similar for musicians and non-musicians. Notably, the calculation differences across models do not lead to the same pattern of results for N1 and P2, demonstrating that the four models are not interchangeable and are not directly comparable.

Highlights

In audiovisual (AV) perception, studies on speech have established that seeing a talker’s face can facilitate reaction time and intelligibility, compared to unimodal auditory perception (Besle et al, 2004; Schwartz et al, 2004; Remez, 2005; Campbell, 2007; Paris et al, 2013; van Wassenhove, 2013; Karas et al, 2019)
Musicians and non-musicians were compared based on their N1 and P2 amplitudes and latencies in the non-target trials in the audio only (AO) condition (Figure 2)
Previous research has led to four AV models to examine the interaction of auditory and visual perception when the visual cues predict an upcoming auditory signal and lead to N1 and P2 suppression

Summary

Introduction

In audiovisual (AV) perception, studies on speech have established that seeing a talker’s face can facilitate reaction time and intelligibility, compared to unimodal auditory perception (Besle et al, 2004; Schwartz et al, 2004; Remez, 2005; Campbell, 2007; Paris et al, 2013; van Wassenhove, 2013; Karas et al, 2019). N1, a negative component occurring around 100 ms after the stimulus onset, is sensitive to general attributes of the stimuli such as predictability of the upcoming sound based on the visual cues (Arnal et al, 2009; Paris et al, 2016a, 2017), spatial information (Stekelenburg and Vroomen, 2012), and temporal information (e.g., Senkowski et al, 2007b; Paris et al, 2017). P2, a positive component occurring around 200 ms after the stimulus onset, is sensitive to the content congruency and the integration between the visual information and perceived auditory signal (Van Wassenhove et al, 2005; Arnal et al, 2009; Paris et al, 2016b). In AV perception, while both N1 and P2 show AV interaction, N1 is more sensitive to the predictiveness of the visual cues, and P2 is more sensitive to the integration of auditory and visual information (e.g., Paris et al, 2016a, 2017)

Results

Discussion

Conclusion