Partial visual-semantic embedding: Fine-grained outfit image representation with massive volumes of tags via angular-based contrastive learning

Ryotaro Shimizu,Takuma Nakamura,Masayuki Goto

doi:10.1016/j.knosys.2023.110791

Abstract

A novel technology named fashion intelligence system, which quantifies ambiguous expressions unique to fashion, such as “casual,” “adult-casual,” and “office-casual,” was previously proposed to support users in their understanding of fashion. However, the existing visual-semantic embedding (VSE) model, which forms the basis of the system, does not support images that are composed of multiple parts, such as those containing hair, tops, trousers, skirts, and shoes. Therefore, we propose a partial VSE (PVSE) model, which enables fine-grained learning of each part of the fashion outfit. The proposed model learns embedded representations via angular-based contrastive learning. This helps in retaining three existing practical functionalities and further enables image-retrieval tasks where changes are only made to specified parts and image-reordering tasks focusing on the specified parts. In other words, the proposed model enables five types of practical functionalities, even with a simple structure. Through qualitative and quantitative experiments, we demonstrate that the proposed model is superior to conventional models, without increasing computational complexity.

Full Text