The neural mechanisms underpinning auditory scene analysis and object formation have been of intense research interest in the past two decades. Fundamentally, however, we live in a multisensory environment. Even Cherry in his original paper posited that “lip reading” as a way for us to solve the cocktail party problem. Yet, how different aspects of visual cues (e.g., timing, linguistic information) help listeners follow conversation in a complex acoustic scene is still not well understood. In this talk, we present a theoretical framework to study audiovisual scene analysis that has been extrapolated from the unisensory object-based attention literature and posit the following questions: How do we define a multi-modal object? What are the predictions from unisensory object-based attention theory when we apply to the audiovisual domain? What are the conceptual models to test the different neural mechanisms that underpin audiovisual scene analysis? Answering these questions would move us closer to addressing the cocktail party problem in the real-world setting as well as help us create, de novo, audiovisual scenes that are more engaging in the augmented/virtual reality world.
Read full abstract