Abstract

Evidence from behavioral studies demonstrates that spoken language guides attention in a related visual scene and that attended scene information can influence the comprehension process. Here we model sentence comprehen- sion within visual contexts. A recurrent neural network is trained to associate the linguistic input with the visual scene and to produce the interpretation of the described event which is part of the visual scene. A feedback mechanism is inves- tigated, which enables explicit utterance-mediated attention shifts to the relevant part of the scene. We compare four models - a simple recurrent network (SRN) and three models with specific types of additional feedback - in order to explore the role of the attention mechanism in the comprehension process. The results show that all networks learn not only successfully to produce the interpretation at the sentence end, but also demonstrate predictive behavior reflected by the ability to anticipate upcoming constituents. The SRN performs expectedly very well, but demonstrates that adding an explicit attentional mechanism does not lead to loss of performance, and even results in a slight improvement in one of the models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.