Vision Transformer (ViT) are widely used in the field of traffic video inspection because of their wide sensing field and strong global context-capturing capability. However, the ViT model has inadequate local detail feature extraction and high computational complexity. Convoluted neural networks(CNN) are commonly used as a means of video image detection. Although it is more flexible and less computationally complex in the extraction of local details, the fixed-size kernel limits it. It is difficult to effectively utilize the full local context information, resulting in the difficulty of effectively extracting the tiny targets with uneven scale distribution in the region. It is prone to leakage and misdetection of small targets. In view of this, this study proposes a new fully localized selective large kernel network (FL-SLKNet). The network effectively captures specific targets and background information through a fully localized feature extraction method, thus making small targets in the region more distinctive features. On this basis, the Adaptive Expansion Residual Module (AERM) is proposed, which solves the problem of image resolution degradation due to sensory field expansion by combining point-by-point convolution and expansion convolution and utilizing residual connections to supplement the lost information. In order to effectively identify tiny targets in the local area, this study proposes the Multi-Scale Frequency Domain Encoder (MSFDEncoder); this encoder utilizes a high-frequency branch to capture local details and a low-frequency branch to focus on the global structure, effectively capturing granular features of targets at various scales. Experimental results on the VisDrone2019-DET and BDD-100K datasets indicate that FL-SLKNet outperforms other advanced models in traffic video detection, with an increase of 3.3% and 3.4% in mAP0.5 compared to the RT-DETR model. Meanwhile, the detection performance of FL-SLKNet is excellent compared with the YOLO series on the SZ Actual Traffic Video dataset. In addition, in the rainy and hazy simulation scenarios, FL-SLKNet’s mAP0.5 improves by 5.5% and 2.6% compared to the RT-DETR model, and the extreme weather robustness performance is better.
Read full abstract