Abstract

Convolutional Neural Networks (CNNs) have always been the dominant method for scene sketch semantic segmentation, but their performance seems to have plateaued due to the limitation of local receptive fields. To address this problem, we propose SketchSeger, a hierarchical Transformer-based model for scene sketch semantic segmentation. Accurate scene sketch segmentation relies on both high-level semantics and low-level details. To obtain better segmentation performance, we designed an MLP-based feature fusion module for the model decoder to merge feature maps captured at different scales efficiently. Compared to CNN-based models, SketchSeger exhibits a stronger ability in contextual modeling and can obtain global receptive fields even in its shallow layers. Besides the model architecture, the absence of large-scale pre-training datasets also presents a significant challenge for advancing scene sketch semantic segmentation. To promote further research, we propose a novel hand-drawn style scene sketch synthesis method and use it to synthesize a dataset containing 300,000 annotated scene sketches. We conduct extensive experiments and visual analysis to validate the efficacy of our proposed SketchSeger model and dataset synthesis approach. The results show that SketchSeger significantly outperforms state-of-the-art models on three benchmark datasets (SketchyScene, SKY-Scene, and TUB-Scene) with similar parameter scales. Codes and datasets are available at https://github.com/jayangcs/SketchSeger.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call