Humans form concepts from multimodal information obtained from multiple sensory organs and understand the environment and language by predicting unobserved information through these concepts. Robots can also make better predictions by forming concepts from multimodal information in a bottom-up manner. In contrast, BERT, a language model, acquires semantic representations of sentences applicable to various language comprehension tasks by learning relationships between words through mask prediction learning. We believe that BERT can be extended to multimodality and learn multimodal representations for world understanding by learning relationships among data across modalities. Therefore, we propose a multimodal learning model using BERT and VQ-VAEs with multimodal information as input and evaluate the model in terms of model predictability and concept formation. To verify the usefulness of the proposed model, experiments were conducted using a multimodal dataset of objects acquired by a robot. The experimental results showed that the proposed model has higher category classification accuracy and word prediction performance than multimodal latent Dirichlet allocation (MLDA), which is used as a concept formation model. An analysis of the representation of the model showed that clusters of object categories were formed within the model.
Read full abstract