This paper addresses the challenge of low detection accuracy of grape clusters caused by scale differences, illumination changes, and occlusion in realistic and complex scenes. We propose a multi-scale feature fusion and augmentation YOLOv7 network to enhance the detection accuracy of grape clusters across variable environments. First, we design a Multi-Scale Feature Extraction Module (MSFEM) to enhance feature extraction for small-scale targets. Second, we propose the Receptive Field Augmentation Module (RFAM), which uses dilated convolution to expand the receptive field and enhance the detection accuracy for objects of various scales. Third, we present the Spatial Pyramid Pooling Cross Stage Partial Concatenation Faster (SPPCSPCF) module to fuse multi-scale features, improving accuracy and speeding up model training. Finally, we integrate the Residual Global Attention Mechanism (ResGAM) into the network to better focus on crucial regions and features. Experimental results show that our proposed method achieves a mAP0.5\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$_{0.5}$$\\end{document} of 93.29% on the GrappoliV2 dataset, an improvement of 5.39% over YOLOv7. Additionally, our method increases Precision, Recall, and F1 score by 2.83%, 3.49%, and 0.07, respectively. Compared to state-of-the-art detection methods, our approach demonstrates superior detection performance and adaptability to various environments for detecting grape clusters.
Read full abstract