Abstract

AbstractBackgroundDeep learning methods, where models do not use explicit features and instead rely on implicit features estimated during model training, suffer from an explainability problem. In text classification, saliency maps that reflect the importance of words in prediction are one approach toward explainability. However, little is known about whether the salient words agree with those identified by humans as important.ObjectivesThe current study examines in‐line annotations from human annotators and saliency map annotations from a deep learning model (ELECTRA transformer) to understand how well both humans and machines provide evidence for their assigned label.MethodsData were responses to test items across a mix of United States subjects, states, and grades. Humans were trained to annotate responses to justify a crisis alert label and two model interpretability methods (LIME, Integrated Gradients) were used to obtain engine annotations. Human inter‐annotator agreement and engine agreement with the human annotators were computed and compared.Results and ConclusionsHuman annotators agreed with one another at similar rates to those observed in the literature on similar tasks. The annotations derived using the integrated gradients (IG) agreed with human annotators at higher rates than LIME on most metrics; however, both methods underperformed relative to the human annotators.ImplicationsSaliency map‐based engine annotations show promise as a form of explanation, but do not reach human annotation agreement levels. Future work should examine the appropriate unit for annotation (e.g., word, sentence), other gradient based methods, and approaches for mapping the continuous saliency values to Boolean annotations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call