Target detection in satellite images is an essential topic in the field of remote sensing and computer vision. Despite extensive research efforts, accurate and efficient target detection in remote sensing images remains unsolved due to the large target scale span, dense distribution, and overhead imaging and complex backgrounds, which result in high target feature similarity and serious occlusion. In order to address the above issues in a comprehensive manner, within this paper, we first propose a Centralised Visual Processing Center (CVPC), this structure is a parallel visual processing center for Transformer encoder and CNN, employing a lightweight encoder to capture broad, long-range interdependencies. Pixel-level Learning Center (PLC) module is used to establish pixel-level correlations and improve the depiction of detailed features. CVPC effectively improves the detection efficiency of remote sensing targets with high feature similarity and severe occlusion. Secondly, we propose a centralised feature cross-layer fusion pyramid structure to fuse the results with the CVPC in a top-down manner to enhance the detailed feature representation capability at each layer. Ultimately, we present a Context Enhanced Adaptive Sparse Convolutional Network (CEASC), which improves the accuracy while ensuring the detection efficiency. Based on the above modules, we designed and conducted a series of experiments. These experiments are conducted on three challenging public datasets, DOTA-v1.0, DIOR, and RSDO, showing that our proposed 3CNet achieves a more advanced detection accuracy while balancing the detection speed (78.62% mAP for DOTA-v1.0, 79.12% mAP for DIOR, and 95.50% mAP for RSOD).
Read full abstract