Abstract

Aggregating features at various levels or scales has been empirically demonstrated to enhance feature representations in object detection. However, existing approaches tend to aggregate features or embed contextual information indiscriminately through simple concatenation or addition, which disregards the misalignment resulting from repeated sampling operations. This paper proposes a feature-aligned network based on YOLOv5 to address the misalignment issues, namely AlignYOLO. The network consists of three primary modules: the self-attention convolution (SAC) module, the feature aggregation and alignment (FAA) module, and the multiscale aligned channel attention (MSACA) module. Firstly, the SAC module comprehensively extracts information by simultaneously employing both convolution and self-attention. Secondly, the FAA module aggregates features across layers and aligns them through the adoption of a learnable interpolation strategy. Lastly, the MSACA module employs multiscale convolution to capture contextual information. The in-layer features are aligned with the learnable interpolation strategy. Additionally, channel attention is leveraged to enhance feature representations. Extensive experiments are conducted on benchmark datasets to evaluate the effectiveness of the proposed method, where AlignYOLO outperforms state-of-the-art detectors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call