Traditional image captioning methods only have a local perspective at the dataset level, allowing them to explore dispersed information within individual images. However, the lack of a global perspective prevents them from capturing common characteristics among similar images. To address the limitation, this paper introduces a novel Triple-stream Commonsense Circulating Transformer Network (TCCTN). It incorporates contextual stream into the encoder, combining enhanced channel stream and spatial stream for comprehensive feature learning. The proposed commonsense-aware contextual attention (CCA) module queries commonsense contextual features from the dataset, obtaining global contextual association information by projecting grid features into the contextual space. The pure semantic channel attention (PSCA) module leverages compressed spatial domain for channel pooling, focusing on attention weights of pure channel features to capture inherent semantic features. The region spatial attention (RSA) module enhances spatial concepts in semantic learning by incorporating region position information. Furthermore, leveraging the complementary differences among the three features, TCCTN introduces the mixture of experts strategy to enhance the unique discriminative ability of features and promote their integration in textual feature learning. Extensive experiments on the MS-COCO dataset demonstrate the effectiveness of contextual commonsense stream and the superior performance of TCCTN.
Read full abstract