Convolutional neural networks (CNN) and Vision Transformers (ViT) have long been the main backbone networks for visual classification in the field of deep learning. Although ViT has recently received more attention than CNN due to its excellent fitting ability, their scalability is largely limited by the quadratic complexity of attention computation. For the determination of diabetic retinopathy, the fundus lesions as well as the width, angle, and branching pattern of retinal blood vessels are characterized, inspired by the ability of Mamba and VMamba to efficiently model long sequences, VMamba-m is proposed in this paper. This is a generalized visual skeleton model designed to reduce computational complexity to linear while retaining the advantageous features of ViTs. By modifying the cross-entropy loss function, we enhance the model’s attention to rare categories, especially in large-scale multi-category classification tasks. In order to enhance the adaptability of the VMamba-m model in processing visual data, we introduce the se channel attention mechanism, which enables the model to learn features in the channel dimension and form the importance of each channel. Finally, different weights are assigned to each channel through the incentive part. In addition to this, this paper further improves the implementation details and architectural design by introducing a novel attention mechanism implemented based on the local windowing method, which aims to optimize the model’s ability in processing long sequence data to enhance the performance of VMamba-m and improve its inference speed. Extensive experimental results show that VMamba-m performs well in the retinopathy V classification task, and it has significant advantages in terms of accuracy and computation time over existing benchmark models.
Read full abstract