Abstract

The increasing number and complex content of aerial images have made some recent methods based on deep learning not fit well with different aerial image processing tasks. The coarse-grained feature representation proposed by these methods is not discriminative enough. Besides, the confounding factors in the datasets and long-tailed distribution of the training data will lead to biased and spurious associations among the objects of aerial images. This study proposes a confounder-free fusion network (CFF-NET) to address the challenges. Global and local feature extraction branches are designed to capture comprehensive and fine-grained deep features from the whole image. Specifically, to extract the discriminative local feature and explore the contextual information across different regions, the models based on gated recurrent units (GRUs) are constructed to extract features of the image region and output the important weight of each region. Further, the confounder-free object feature extraction branch is proposed to generate reasonable visual attention and provide more multi-grained image information. It also eliminates the spurious and biased visual relationships of the image on the object-level. Finally, the output of the three branches is combined to obtain the fusion feature representation. Extensive experiments are conducted on the three popular aerial image processing tasks: image classification, image retrieval, and image captioning. It is found that the proposed CFF-NET achieve reasonable and state-of-the-art results, including high-level task such as aerial image captioning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call