Explainable Text Classification Techniques in Legal Document Review: Locating Rationales without Using Human Annotated Training Text Snippets

Christian Mahoney,Peter Gronvall,Jianping Zhang,Nathaniel Huber-Fliflet

doi:10.1109/bigdata55660.2022.10020626

Abstract

US corporations regularly spend millions of dollars reviewing electronically-stored documents in legal matters. Recently, attorneys apply text classification to efficiently cull massive volumes of data to identify responsive documents for use in these matters. While text classification is regularly used to reduce the discovery costs of legal matters, it also faces a perception challenge: amongst lawyers, this technology is sometimes looked upon as a "black box." Put simply, no extra information is provided for attorneys to understand why documents are classified as responsive. In recent years, explainable machine learning has emerged as an active research area. In an explainable machine learning system, predictions or decisions made by a machine learning model are human understandable. In legal ‘document review’ scenarios, a document is responsive, because one or more of its small text snippets are deemed responsive. In these scenarios, if these responsive snippets can be located, then attorneys could easily evaluate the model’s document classification decisions – this is especially important in the field of responsible AI. Our prior research identified that predictive models created using annotated training text snippets improved the precision of a model when compared to a model created using all of a set of documents’ text as training. While interesting, manually annotating training text snippets is not generally practical during a legal document review. However, small increases in precision can drastically decrease the cost of large document reviews. Automating the identification of training text snippets without human review could then make the application of training text snippet-based models a practical approach. This paper proposes two simple machine learning methods to locate responsive text snippets within responsive documents without using human annotated training text snippets. The two methods were evaluated and compared with a document classification method using three datasets from actual legal matters. The results show that the two proposed methods outperform the document-level training classification method in identifying responsive text snippets in responsive documents. Additionally, the results suggest that we can automate the successful identification of training text snippets to improve the precision of our predictive models in legal document review and thereby help reduce the overall cost of review.

Full Text