Abstract

This paper focuses on the representation/latent space in neural architectures to develop an end-to-end explanation approach for Image Captioning (IC) models. By injecting Gaussian perturbations into the latent space of each component of the architecture, we first analyze and identify the parts of the model likely to be the most decisive/influential in the caption generation. The results show that the visual part, mainly composed of visual encoding and attention mechanism, is more decisive than the language part, which could lead to more subtle explanations. We then follow this approach with an in-depth explanation protocol that also utilizes the latent space and focuses on the visual modality to design and compare two explanation methods with different scopes; (1) a surrogate-based method with Local Interpretable Model-Agnostic Explanations (LIME), with local scope. (2) a backpropagation-based method with Layer-wise Relevance Propagation (LRP) for global explanations. To assess the quality of the obtained explanations, we propose the new concept of Latent Ablation, which proves to be more consistent than classical Ablation, which usually leads to inconsistencies and truncated information. Extensive experiments show that both methods achieve comparable results and that their scope has no explicit impact on the quality of the explanations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call