Abstract

Human infants are able to acquire natural language seemingly easily at an early age. Their language learning seems to occur simultaneously with learning other cognitive functions as well as with playful interactions with the environment and caregivers. From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired by means of crossmodal integration. However, characterizing the underlying mechanisms in the brain is difficult and explaining the grounding of language in crossmodal perception and action remains challenging. In this paper, we present a neurocognitive model for language grounding which reflects bio-inspired mechanisms such as an implicit adaptation of timescales as well as end-to-end multimodal abstraction. It addresses developmental robotic interaction and extends its learning capabilities using larger-scale knowledge-based data. In our scenario, we utilize the humanoid robot NICO in obtaining the EMIL data collection, in which the cognitive robot interacts with objects in a children's playground environment while receiving linguistic labels from a caregiver. The model analysis shows that crossmodally integrated representations are sufficient for acquiring language merely from sensory input through interaction with objects in an environment. The representations self-organize hierarchically and embed temporal and spatial information through composition and decomposition. This model can also provide the basis for further crossmodal integration of perceptually grounded cognitive representations.

Highlights

  • While research in natural language processing has advanced in specific disciplines such as parsing and classifying large amounts of text, human-computer communication is still a major challenge, due to multiple aspects: speech recognition is limited to good signal-to-noise conditions or well-adapted models, dialogue systems depend on a well-defined context, and language elements are difficult to reconcile with the environmental situation

  • In order to analyse our model for how compositional language is grounded in multimodal sensations and how multimodal abstraction emerges through learning, we trained different variants of our model on different variants on the Embodied Multimodal Interaction in Language learning (EMIL) data sets

  • We started with the model architecture from baseline Continuous Time Recurrent Neural Networks (CTRNN), which are configured with equal timescales τ = 1 for all neurons

Read more

Summary

Introduction

While research in natural language processing has advanced in specific disciplines such as parsing and classifying large amounts of text, human-computer communication is still a major challenge, due to multiple aspects: speech recognition is limited to good signal-to-noise conditions or well-adapted models, dialogue systems depend on a well-defined context, and language elements are difficult to reconcile with the environmental situation. Interactive robots that match human communication performance are not yet available. One reason for this is the fact that the crossmodal binding between language, actions, and visual events is not yet fully understood and was not realized in technical systems that have to interact with humans (Hagoort, 2017). Imaging techniques such as Functional Magnetic Resonance Imaging (fMRI) have provided a better understanding of which areas in the cortex are involved in natural language processing and Crossmodal Language Grounding that these areas include somatosensory regions. This means that it is grounded in most sensory and sensorimotor modalities and that the human brain architecture favors the acquisition of language by means of crossmodal integration (Pulvermüller, 2018)

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call