Abstract

Image captioning tries to make computer learn to understand the visual content of a given image and produce one or more description sentences. Recently, with the help from rapid development of deep learning, image captioning has become an active research. The encoder-decoder architecture is commonly considered as the baseline method to address image captioning. This model focuses on recognizing objects and the relationship among these objects. However, with fashion images, the generated sentence should not only describe items inside the input image, but also mention item attributes such as texture, fabric, shape, style, etc. This requirement of fashion image captioning is not able to be solved by the image-captioning method based on the traditional encoder-decoder architecture. Our study addresses this issue by proposing an image captioning model based on the attention mechanism for fashion image which is able to cover both items and the relationship among the detailed attributes of items. We introduce an efficient framework for fashion image captioning that incorporates spatial attention inside the traditional encoder-decoder architecture. Our model generates fashion image captions using the spatial attention mechanism, which dynamically modulates the sentence generation context in multi-layer feature maps. Experiments were conducted on Fashion-Gen, one of state-of-the-art fashion image dataset, and achieved CIDEr/ROUGE-L/BLEU-4 scores of 0.913, 0.502, and 0.221, respectively. Based on experiments, it is consistently presented that our proposed model has significantly improved the performance of fashion-image captioning task, and it even surpasses the baseline methods on the same fashion benchmark dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call