Many exceptional deep learning networks have demonstrated remarkable proficiency in general object detection tasks. However, the challenge of detecting ships in synthetic aperture radar (SAR) imagery increases due to the complex and various nature of these scenes. Moreover, sophisticated large-scale models necessitate substantial computational resources and hardware expenses. To address these issues, a new framework is proposed called a stepwise attention-guided multiscale feature fusion network (SAFN). Specifically, we introduce a stepwise attention mechanism designed to selectively emphasize relevant information and filter out irrelevant details of objects in a step-by-step manner. Firstly, a novel LGA-FasterNet is proposed, which incorporates a lightweight backbone FasterNet with lightweight global attention (LGA) to realize expressive feature extraction while reducing the model’s parameters. To effectively mitigate the impact of scale and complex background variations, a deformable attention bidirectional fusion network (DA-BFNet) is proposed, which introduces a novel deformable location attention (DLA) block and a novel deformable recognition attention (DRA) block, strategically integrating through bidirectional connections to achieve enhanced features fusion. Finally, we have substantiated the robustness of the new framework through extensive testing on the publicly accessible SAR datasets, HRSID and SSDD. The experimental outcomes demonstrate the competitive performance of our approach, showing a significant enhancement in ship detection accuracy compared to some state-of-the-art methods.