Describe Fashion Products via Local Sparse Self-Attention Mechanism and Attribute-Based Re-Sampling Strategy

Yuhao Tang,Ye Yuan,Liyan Zhang,Zhixian Chen

doi:10.1109/tcsvt.2022.3233369

Abstract

In this paper, we convert conventional image captioning task into a new paradigm in the fashion domain, fashion item captioning. This task requires the ability of associating a group of item angles and generating a longer and more fine-grained description. To link several images in a more efficient way, we propose a local sparse self-attention mechanism(LSAM), which only enables one region to interact with those in the adjacent area and allows a lighter architecture with fewer layers. To capture more subtle details of the item, an attribute-based re-sampling strategy(ARS) is introduced to enhance the learning on those low-frequency but content-related attribute words. Besides, existing fashion datasets are limited in the quality of annotations and single fashion style. To bridge this gap, we propose a novel dataset for fashion item captioning, termed Fashion Item Captioning Dataset(FICD). FICD provides a meaningful complement to the existing fashion datasets, which comprises 294K images, 62K real-world product descriptions with diverse linguistic styles, rich attributes and categories. It is worth mentioning that we also annotate an attribute-level sentence for each item, which enables the FICD to be used not only for fashion item captioning but also for other fashion-related tasks. Moreover, the complex structure is even more restrictive for application to real-world scenarios. Without bells and whistles, our framework is simply designed with an end-to-end manner. Extensive experiments demonstrate the effectiveness of our LSAM-ARS. More remarkably, LSAM-ARS achieves state-of-the-art performance on FACAD and FICD datasets, with the CIDEr-D score being increased from 65.4% to 81.8%, 69.8% to 77.4%, respectively.

Full Text