With the increasing popularity of online fruit sales, accurately predicting fruit yields has become crucial for optimizing logistics and storage strategies. However, existing manual vision-based systems and sensor methods have proven inadequate for solving the complex problem of fruit yield counting, as they struggle with issues such as crop overlap and variable lighting conditions. Recently CNN-based object detection models have emerged as a promising solution in the field of computer vision, but their effectiveness is limited in agricultural scenarios due to challenges such as occlusion and dissimilarity among the same fruits. To address this issue, we propose a novel variant model that combines the self-attentive mechanism of Vision Transform, a non-CNN network architecture, with Yolov7, a state-of-the-art object detection model. Our model utilizes two attention mechanisms, CBAM and CA, and is trained and tested on a dataset of apple images. In order to enable fruit counting across video frames in complex environments, we incorporate two multi-objective tracking methods based on Kalman filtering and motion trajectory prediction, namely SORT, and Cascade-SORT. Our results show that the Yolov7-CA model achieved a 91.3% mAP and 0.85 F1 score, representing a 4% improvement in mAP and 0.02 improvement in F1 score compared to using Yolov7 alone. Furthermore, three multi-object tracking methods demonstrated a significant improvement in MAE for inter-frame counting across all three test videos, with an 0.642 improvement over using yolov7 alone achieved using our multi-object tracking method. These findings suggest that our proposed model has the potential to improve fruit yield assessment methods and could have implications for decision-making in the fruit industry.
Read full abstract