Abstract

Despite their success deep neural networks still lack interpretability and are regarded as black boxes. This hampers a wider adoption in applications with societal, environmental or economical implications, and motivated a variety of techniques for explaining their outputs. Such explanations are however typically produced after model training so there is no guarantee that models learn faithful attributions, a goal they were not trained for. We evaluate the impact of different penalty terms in the loss function that promote explainable feature attributions, and that can be learned during training in an unsupervised way. We show that explainability-constrained models produce better saliency maps based on multiple metrics and tests. Regularizers imposing locality, fidelity and symmetry properties lead to the best performances in terms of MoRF and ROAR scores.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call