Abstract

In this paper, we explore how to utilize the video context to facilitate fashion parsing. Instead of annotating a large amount of fashion images, we present a general, affordable and scalable solution, which harnesses the rich contexts in easily available fashion videos to boost any existing fashion parser. First, we crawl a large unlabelled fashion video corpus with fashion frames. Then for each fashion video, the cross-frame contexts are utilized for human pose co-estimation, and then video co-parsing to obtain satisfactory fashion parsing results for all frames. More specifically, Sift Flow and super-pixel matching are used to build correspondences across frames, and these correspondences then con- textualize the pose estimations and fashion parsing in individual frames. Finally, these parsed video frames are used as the reference corpus for the non-parametric fashion parsing component of the whole solution. Extensive experiments on two benchmark fashion datasets as well as a newly collected challenging Fashion Icon (FI) dataset demonstrate the encouraging performance gain from our general pipeline for fashion parsing.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call