Abstract

Generating the data of an absent modality based on existing modal information is valuable for realizing audio-visual intermodal information complementarity. However, existing audio-visual generation methods require strict timing synchronization between the data of two modalities, which is very time-consuming and expensive. In this paper, considering the extensive audio-visual semantic associations, we propose a semantic consistency audio-to-image generative adversarial network (SCAIGAN) to generate visual images with the corresponding semantics directly from audio spectrograms. Particularly, in our model, three mechanisms are exploited. First, a self-attention mechanism is added to the encoder to better capture the global features and geometric structure of the high-dimensional characteristics of the data. Second, the projection mechanism is used in a discriminator to constrain the generator in such a way that a type of cross-modal-based self-supervision under semantic consistency can be embedded. Finally, self-modulation batch normalization is applied to the generator to accelerate the convergence and improve the quality of the generated images. Experiments demonstrate that our model can generate clear visual images with diversity on both instrument and face datasets and can achieve better classification accuracy than the other state-of-the-art methods. Our code will be made publicly available at https://github.com/PengchengZhao1001/AV-Correlation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call