Arbitrary style transfer is attracting increasing attention in the computer vision community due to its application flexibility. Existing approaches directly fuse deep style features with deep content features or adaptively normalize content features for global statistical matching. Although effective, it is prone to suffer from local unnatural outputs and artifacts owing to the lack of exploring the global contextual semantic distribution of style image features. In this paper, a novel global context self-attentional network (GCSANet) is proposed to efficiently generate high-quality stylized results based on the global semantic spatial distributions of style images. First, a context modeling module is proposed to aggregate the depth features of style images into global context features. Then, channel-wise interdependencies are captured with the feature transform module. Finally, the style features are appropriately aggregated to each location of the content image. In addition, novel external contrastive losses are proposed to balance the distribution of content and style features to ensure the reasonableness of the texture patterns in the stylized images. The ablation studies validate the effectiveness of the proposed components. Various quantitative and qualitative experiments demonstrate the superiority of our method for real-time arbitrary image/video style transfer.

Full Text

Published Version
Open DOI Link

Get access to 115M+ research papers

Discover from 40M+ Open access, 2M+ Pre-prints, 9.5M Topics and 32K+ Journals.

Sign Up Now! It's FREE

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call