Multimodal tasks have become a hot research direction in recent years. The emergence of large-scale models has gradually elevated the extensive multimodal tasks, achieving remarkable achievements. However, when it comes to fusing multiple modalities for multimodal tasks, how to better integrate multimodal features is still a problem worth exploring. In tasks such as sentiment analysis targeting a wide range of social media content, the use of features derived solely from the [CLS] token may lead to insufficient information. This paper proposes BVA-Transformer model architecture for image-text multimodal classification and dialogue, which incorporates the EF-CaTrBERT method for feature fusion and introduces BLIP for the transformation of images to the textual space. This enables the fusion of images and text in the same information space, avoiding issues of information redundancy and conflict compared to traditional feature fusion methods. In addition, we proposed a Global Features Encoder (GFE) module based on visual attention in the BVA-Transformer, which can provide more global and targeted auxiliary features for the [CLS] token. This enables the model to utilize more feature information in classification tasks under this feature fusion method and dynamically select information to focus on. We also introduced the Trv structure from EVA-02 in the Decoder part of BVA-Transformer, investigating its impact on the model performance. Furthermore, we designed a three-stage training to further enhance the model’s performance. Experimental results demonstrate that BVA-Transformer achieves high-quality classification while generating dialogue sentences. Compared to existing multimodal classification models on our validation dataset, it exhibits excellent performance.