Can we predict a person’s appearance solely based on their voice? This article explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive (CMPC) learning to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice–face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice–face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.
Read full abstract