Abstract

Multimodal deep learning is an emerging research topic in machine learning and involves the parallel processing of different modalities of data such as texts, images and audiovisual data. Well-known application areas are multimodal image and video processing as well as speech recognition. In this paper, we propose a multimodal neural network that measures the similarity of text-written product descriptions and images and has applications in inventory reconciliation and search engine optimization. We develop two models. The first takes image and text data, each processed by convolutional neural networks, and combines the two modalities. The second is based on a bidirectional triplet loss function. We conduct experiments using ABO! (ABO!) dataset and an industry-related dataset used for the inventory reconciliation of a mechanical engineering company. Our first model achieves an accuracy of 92.37% with ResNet152 on the ABO! dataset and 99.11% with MobileNetV3_Large on our industry-related dataset. By extending this model to a model with three inputs, two text inputs and one image input, we greatly improve the performance and achieve an accuracy of 97.57% on the ABO! dataset and 99.83% with our industry related inventory dataset. Our second model based on the triplet loss achieves only an accuracy of 73.85% on the ABO! dataset. However, our experiments demonstrate that multimodal networks consistently perform better when measuring the similarity of products, even in situations where one modality lacks sufficient data, because it is complemented with the other modality. Our proposed approaches open up several possibilities for further optimization of search engines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call