DFYOLOv5m-M2transformer: Interpretation of vegetable disease recognition results using image dense captioning techniques

Wei Sun,Chunshan Wang,Huarui Wu,Yisheng Miao,Huaji Zhu,Wang Guo,Jiuxi Li

doi:10.1016/j.compag.2023.108460

Abstract

The latest advances in deep learning technology make it possible to recognize vegetable diseases from leaf images. The existing disease recognition methods based on computer vision have shown exciting achievements in terms of accuracy, stability, and portability. However, these methods cannot provide a decision-making basis for the final results, and lack a text basis to support the users’ judgement. Disease diagnosis is a risky decision. If the detection method lacks transparency, the users will not be able to fully trust the recognition results, which greatly limits the application of various recognition methods based on deep learning. Aiming at the problem of low “man–machine” credibility due to the fact that deep learning-based methods are unable to provide decision-making basis, this paper proposed a two-stage image dense captioning model named “DFYOLOv5m-M2Transformer”, which can generate description sentences of visualized disease features on the basis of the recognized diseased area. Firstly, we established a target detection dataset and a dense captioning dataset containing leaf images of 10 diseases, involving 2 vegetables, i.e., cucumber and tomato. Secondly, we chose the DFYOLOv5m network as the disease detector to extract the diseased area from the image, and the M2-Transformer network as the decision basis generator to generate description sentences of disease features. Then, the Bi-Level Routing Attention module was introduced to extract fine-grained features under complex backgrounds in order to resolve the problem of poor feature extraction in case of mixed diseases. Finally, we used Atrous Convolution to expand the receptive field of the model, and fused NWD and CIoU to improve the model’s performance in detecting small targets. The experimental results show that the IoU and Meteor joint evaluation indicator of DFYOLOv5m-M2Transformer achieved a mean Average Precision (mAP) of 94.7 % on the dense captioning dataset, which was 7.2 % higher than that of the best-performing model Veg-DenseCap in the control group. Moreover, the decision basis that is automatically generated by the proposed model is characterized by the advantages of high accuracy, correct grammar and large sentence variety. The outcome of this study provides a new idea for optimizing the user experience in using vegetable disease recognition models.

Full Text