Training machine learning (ML) models for artificial intelligence (AI) and computer vision-based object detection process typically requires large, labeled datasets, a process often burdened by significant human effort and high costs associated with imaging systems and image acquisition. This research aimed to simplify image data collection for object detection in orchards by avoiding traditional fieldwork with different imaging sensors. Utilizing OpenAI's DALLE, a large language model (LLM) for realistic image generation, we generated and annotated a cost-effective dataset. This dataset, exclusively generated by LLM, was then utilized to train two state-of-the-art deep learning models: YOLOV10 and YOLO11. The YOLO11 model for apple detection was trained with its five configurations (YOLO11n, YOLO11 s, YOLO11 m, YOLO11l and YOLO11x), and YOLOv10 model with its six configurations (YOLOv10n, YOLOv10 s, YOLOv10 m, YOLOv10b, YOLOv10l and YOLOv10x), which was then tested with real-world (outdoor orchard) images captured by a digital (Nikon D5100) camera and a consumer RGB-D camera (Microsoft Azure Kinect). YOLO11 outperformed YOLOv10 as YOLO11x and YOLO11n exhibited superior precision of 0.917 and 0.916, respectively. Furthermore, YOLO11l demonstrated the highest recall among its counterparts, achieving a recall of 0.889. Likewise, the YOLO11n variant excelled in terms of mean average precision (mAP@50), achieving the highest value of 0.958. Validation tests against actual images collected through a digital camera (Nikon D5100) over Scilate apple variety in a commercial orchard environment showed a highest precision of 0.874 for YOLO11 s, recall of 0.877 for YOLO11l and mAP@50 of 0.91 for YOLO11x. Additionally, validation test against actual images collected through a Microsoft Azure camera over the same orchard showed a highest precision, recall and mAP@50 respectively of 0.924, 0.781 and 0.855 with YOLO11x. All variants of YOLO11 surprisingly demonstrated a pre-processing time of just 0.2 milliseconds (ms), which was faster than any variant of YOLOv10. The fastest inference time for the YOLO11n model using the training dataset generated by the language model was 3.2 ms, while YOLOv10n, fastest among YOLOv10 variants, had a longer inference time of 5.5 ms. Likewise, the fastest inference time for the sensor-based images was 7.1 ms (for Nikon D5100 camera images) and 4.7 ms (for Azure images) with YOLO11n. This study presents a pathway for generating large image datasets using LLM in challenging agricultural fields with minimal or no labor-intensive efforts in field data-collection, which could accelerate the development and deployment of computer vision and robotic technologies in orchard environments.
Read full abstract