Abstract

Accurately interpreting image contents plays a vital role in many earth observation tasks. This letter constructs a novel cross-context and cross-scale capsule vision transformer (C<sup>2</sup>-CapsViT) architecture to serve for remote sensing image scene classification. First, employed with a multi-context patch embedding strategy, the token representation quality is greatly boosted to encode different-context feature semantics. Second, designed with a multiscale transformer block, different-grained long-range global feature interactions and different-type feature self-attentions are concurrently exploited to promote the feature encoding quality. Moreover, by combining the convolution and transformer structures, local and global feature semantics are effectively fused to direct accurate predictions. The C<sup>2</sup>-CapsViT is elaborately verified on three scene classification data sets. Both quantitative evaluations and comparative analyses prove its competitive capability and advanced performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call