Managing complex AI systems requires insight into a model's decision-making processes. Understanding how these systems arrive at their conclusions is essential for ensuring reliability. In the field of explainable natural language processing, many approaches have been developed and evaluated. However, experimental analysis of explainability for text classification has been largely constrained to short text and binary classification. In this applied work, we study explainability for a real-world task where the goal is to assess the technological suitability of standards. This prototypical use case is characterized by large documents, technical language, and a multi-label setting, making it a complex modeling challenge. We provide an analysis of approx. 1000 documents with human-annotated evidence. We then present experimental results with two explanation methods evaluating plausibility and runtime of explanations. We find that the average runtime for explanation generation is at least 5 minutes and that the model explanations do not overlap with the ground truth. These findings reveal limitations of current explanation methods. In a detailed discussion, we identify possible reasons and how to address them on three different dimensions: task, model and explanation method. We conclude with risks and recommendations for the use of feature attribution methods in similar settings.
Read full abstract