Multi-label legal document classification: A deep learning-based approach with label-attention and domain-specific pre-training

Dezhao Song,Andrew Vold,Kanika Madan,Frank Schilder

doi:10.1016/j.is.2021.101718

Abstract

Multi-label document classification has a broad range of applicability to various practical problems, such as news article topic tagging, sentiment analysis, medical code classification, etc. A variety of approaches (e.g., tree-based methods, neural networks and deep learning systems that are specifically based on pre-trained language models) have been developed for multi-label document classification problems and have achieved satisfying performance on different datasets. In the legal domain, however, one is often faced with several key challenges when working with multi-label classification tasks. One critical challenge is the lack of high-quality human labeled datasets, which prevents researchers and practitioners from achieving decent performance on respective tasks. Also, existing methods on multi-label classification typically focus on the majority classes, which results in an unsatisfying performance for other important classes that do not have sufficient training samples. In order to tackle the above challenges, in this paper, we first present POSTURE50K, a novel legal extreme multi-label classification dataset, which we will release to the research community. The dataset contains 50,000 legal opinions and their manually labeled legal procedural postures. Labels in this dataset follow a Zipfian distribution, leaving many of the classes with only a few samples. Furthermore, we propose a deep learning architecture that adopts domain-specific pre-training and a label-attention mechanism for multi-label document classification. We evaluate our proposed architecture on POSTURE50K and another legal multi-label dataset EUROLEX57K, and show that our approach achieves better performances than two baseline systems and another four recent state-of-the-art methods on both datasets.

Full Text