Abstract

Visual dialog aims to accomplish multiple rounds of dialog by fusing information extracted from images, captions, and previous question–answer pairs. As a vision-language task, visual dialog encounters challenges related to language bias and vision bias. These biases create an imbalance in multi-modal fusion, resulting in shortcut learning and significantly compromising the model’s robustness. Moreover, existing multi-modal fusion methods in visual dialog exhibit a low data interaction frequency, leading to insufficient fusion. To overcome the balance and sufficiency issues in multi-modal fusion, we propose a novel Parallel Attention Fusion visual dialog model with Counterfactual Sample debiasing (CS-PAF). Specifically, CS-PAF consists of two core ingredients: (i) a counterfactual sample generation module for model debiasing; and (ii) a parallel attention fusion network that enhances sufficiency in multi-modal data interaction. Notably, in contrast to other debiasing methods, our counterfactual sample generation applies contrastive learning to circumvent the high cost of manual annotations and ensure seamless integration with other models. Extensive comparisons with state-of-the-art approaches, along with comprehensive ablation and transferability studies across multiple datasets, substantiate the superiority and effectiveness of our CS-PAF. Our implement code is available at https://github.com/chenyulu2000/cspaf.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call