An improved YOLOv7 network using RGB-D multi-modal feature fusion for tea shoots detection

Yanxu Wu,Jianneng Chen,Shunkai Wu,Hui Li,Leiying He,Runmao Zhao,Chuanyu Wu

doi:10.1016/j.compag.2023.108541

Abstract

Due to the increasing scarcity of tea pickers, the implementation of intelligent harvesting for premium tea is a crucial prerequisite for the sustainable development of the premium tea industry. The initial step towards achieving intelligent and precise harvesting is the accurate detection of tender shoots, which consist of one bud and one leaf. However, accurately identifying tea shoots poses a challenging visual task due to their small size, variable shapes, as well as similar colors and backgrounds. The existing model, based on RGB images, can only detect partial targets. To address this issue and further enhance the detection of tea buds, this study proposes the utilization of multi-modal features encompassing red, green, blue, and depth (RGB-D) for identification. In addition, a unidirectional complementary multi-modal fusion method is introduced to minimize the adverse effects caused by low-quality depth information. Firstly, an RGB-D dataset comprising high-quality tea leaves is constructed, and the samples are carefully calibrated. Subsequently, an enhanced end-to-end RGB-D multi-modal object detection network, referred to as YOLO-RGBDtea, is developed based on You Only Look Once version 7 (YOLOv7). This model incorporates a parallel lightweight depth image feature extraction backbone network and incorporates a self-attention mechanism to prioritize contextual information. Lastly, a cross-modal spatial attention fusion module (CSFM) is devised to collaboratively integrate depth features with RGB features in a unidirectional manner. The experimental results reveal that YOLO-RGBDtea achieves an AP50 of 91.12% when confronting complex outdoor tea shoots, exhibiting significant performance improvements compared to YOLOv7, especially in scenarios involving small targets, overlapping target groups, and highly overexposed images. Notably, the parameter increment in YOLO-RGBDtea compared to the original YOLOv7 model is merely 17.8%, and the additional components can be seamlessly transferred to other models. Overall, this study introduces a straightforward yet effective multi-modal fusion method that bears theoretical and practical significance in advancing the detection of high-quality tea shoots in complex outdoor environments.

Full Text