Side-scan sonar plays a crucial role in underwater exploration, and the autonomous detection of side-scan sonar images is vital for detecting unknown underwater environments. However, due to the complexity of the underwater environment, the presence of a few highlighted areas on the targets, blurred feature details, and difficulty in collecting data from side-scan sonar, achieving high-precision autonomous target recognition in side-scan sonar images is challenging. This article addresses this problem by improving the You Only Look Once v7 (YOLOv7) model to achieve high-precision object detection in side-scan sonar images. Firstly, given that side-scan sonar images contain large areas of irrelevant information, this paper introduces the Swin-Transformer for dynamic attention and global modeling, which enhances the model’s focus on the target regions. Secondly, the Convolutional Block Attention Module (CBAM) is utilized to further improve feature representation and enhance the neural network model’s accuracy. Lastly, to address the uncertainty of geometric features in side-scan sonar target features, this paper innovatively incorporates a feature scaling factor into the YOLOv7 model. The experiment initially verified the necessity of attention mechanisms in the public dataset. Subsequently, experiments on our side-scan sonar (SSS) image dataset show that the improved YOLOv7 model has 87.9% and 49.23% in its average accuracy (mAP0.5) and (mAP0.5:0.95), respectively. These results are 9.28% and 8.41% higher than the YOLOv7 model. The improved YOLOv7 algorithm proposed in this paper has great potential for object detection and the recognition of side-scan sonar images.