Vision-language joint representation learning for sketch less facial image retrieval

Dawei Dai,Shiyu Fu,Yingge Liu,Guoyin Wang

doi:10.1016/j.inffus.2024.102535

Dawei Dai, Shiyu Fu + Show 2 more

https://doi.org/10.1016/j.inffus.2024.102535

Copy DOI

Export

Save

Cite

Journal: Information Fusion

Publication Date: Jun 26, 2024

Abstract
Full-Text
Similar Papers

Abstract

Listen

The traditional sketch-based facial image retrieval (SBFIR) framework assumes that a high-quality facial sketch has been prepared prior to the retrieval task. However, drawing such a sketch requires considerable skills and is time consuming, resulting in limited applicability. Sketch less facial image retrieval (SLFIR) framework aims to break these barriers through human–computer interaction during the sketching process. The primary challenges for the SLFIR problem can be noted that initial sketches (at early sketching) contain only local details and exhibit significant differences among users, resulting in poor performance at early stages and weak generalization abilities in practical testing. In this study, we developed a vision–language pretraining model to align the representation of facial images and their associated semantics. Based on this framework, we proposed a method for learning joint representations by fusing sketches with prior semantics, thereby enriching the information of initial sketches. Specifically, (1) we developed a series of well-designed operations to improve the quality of facial image–text pairs in the LAION-Face dataset; we trained a facial vision–language pretraining (FVLP) model to align the facial image and its semantics at the feature level. (2) subsequently, using FVLP as the backbone, we designed a convolutional attention module to fuse the multiscale features extracted from image encoder. This facilitated the learning of a multimodal representation crucial for the final retrieval process. In experiments, our proposed method achieved state-of-the-art performance at early stages on two public datasets; moreover, it exhibited good generalization capabilities during practical testing. Thus, our method significantly outperforms other baselines in terms of early retrieval performance. Codes are available at: https://github.com/ddw2AIGROUP2CQUPT/FVLP.

Full Text