Abstract

How pre-trained transformer-based language models perform grounded language acquisition through cross-situational learning (CSL) remains unclear. In particular, it is still not understood how meaning concepts are captured from complex sentences, along with learning language-based interactions, could benefit the field of human-robot interactions and help understand how children learn and ground language. In our current work, we study cross-situational learning to understand the mechanisms enabling children to learn rapidly sequence of words to meaning concepts: the two sequence-based models (i) Echo State Networks (i.e., Reservoir Computing), and (ii) Long-Short Term Memory Networks (LSTM) makes this mapping. We consider the three different input representations: (i) One-Hot encoding, (ii) BERT fine-tuned on Juven+GOLD corpus, and (iii) Google BERT. We investigate which of these three representations better predict the stimulated vision as a function of sentences describing the scenes using two models. Using our approach, we test two datasets: Juven and GoLD, and present how these models generalize only after a few hundred partially described scenes via cross-situational learning. We find that both One-Hot encoding and BERT fine-tuned representations (for both models) significantly improve the predictions. Moreover, we argue that these models are able to learn complex relations between the contexts in which a word appears and their corresponding meaning concepts, handling polysemous and synonymous words. This aspect could be incorporated into a human-robot interaction study that examines grounding language to objects in a physical world and poses a challenge for researchers to investigate better the use of transformer models in robotics and HCI.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call