Abstract

3D scene stylization aims to generate impressive stylized images from arbitrary novel views based on the stylistic reference. Existing image-driven 3D scene stylization methods require a specific style reference to be given, and lack the ability to produce diverse stylization results by combining style information from different aspects. In this paper, we propose a text-driven 3D scene stylization method based on semantic contrast learning, which takes Neural Radiance Fields (NeRF) as the 3D scene representation and generates diverse 3D stylized scenes by leveraging the semantic capabilities of the Contrastive Language-Image Pre-Training (CLIP) model. For comprehensively exploiting the semantic knowledge to generate finely stylized results, we design a CLIP-based semantic contrast estimation loss, which can avoid the global stylistic inconsistency caused by the NeRF ray sampling method and avoid the tendency to stylize neutral descriptions due to the semantic averaging of the CLIP space. In addition, to reduce the memory burden arising from NeRF ray sampling, we propose a novel ray sampling method with gradient accumulation to optimize the NeRF rendering process. The experimental results indicate that our method generates high-quality and plausible results with cross-view consistency. Moreover, our method enables the creation of new styles that match the target text by combining multiple domains. The code will be available at .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call