Sketch-based image retrieval (SBIR) is a cross-domain retrieval problem whose main challenge is that sketches and images differ in many aspects, such as visual content and abstract degree. Existing approaches solve cross-domain gaps using deep neural networks (DNNs) to learn a joint global feature representation for sketches and images. To improve the retrieval effect using local features, some researchers have introduced attention mechanisms and multi-scale inputs. However, these cannot effectively locate regions of detail difference between sketch–image pairs, and the discriminative ability of these models is insufficient. Just as human observational habits range from global to local, this paper constructs a multiscale network with the help of attention modules. The network is stacked in a recurrent way and focuses stepwise on the most discriminative regions from coarse to fine through region-specific attention models. The part-aware attention module is specifically designed for each scale network to make it robust against local spatial misalignment. To address the challenges of SBIR, we extend the network into a three-channel structure that integrates the sketch, edge map, and RGB photo branches to acquire a more discriminative cross-domain feature representation. The proposed network is alternately optimized by weighted regularization triplet loss, classification loss, and pairwise ranking loss. We conduct extensive experiments on several commonly used SBIR datasets, quantitative, and qualitative results demonstrate that our network learns more fine-grained features and outperforms all baselines.