Effectively Leveraging BERT for Legal Document Classification

Nut Limsopatham

doi:10.18653/v1/2021.nllp-1.22

Abstract

Bidirectional Encoder Representations from Transformers (BERT) has achieved state-of-the-art performances on several text classification tasks, such as GLUE and sentiment analysis. Recent work in the legal domain started to use BERT on tasks, such as legal judgement prediction and violation prediction. A common practise in using BERT is to fine-tune a pre-trained model on a target task and truncate the input texts to the size of the BERT input (e.g. at most 512 tokens). However, due to the unique characteristics of legal documents, it is not clear how to effectively adapt BERT in the legal domain. In this work, we investigate how to deal with long documents, and how is the importance of pre-training on documents from the same domain as the target task. We conduct experiments on the two recent datasets: ECHR Violation Dataset and the Overruling Task Dataset, which are multi-label and binary classification tasks, respectively. Importantly, on average the number of tokens in a document from the ECHR Violation Dataset is more than 1,600. While the documents in the Overruling Task Dataset are shorter (the maximum number of tokens is 204). We thoroughly compare several techniques for adapting BERT on long documents and compare different models pre-trained on the legal and other domains. Our experimental results show that we need to explicitly adapt BERT to handle long documents, as the truncation leads to less effective performance. We also found that pre-training on the documents that are similar to the target task would result in more effective performance on several scenario.

Highlights

Transformers (BERT) (Devlin et al, 2019) has gained attentions from the NLP community due to its effectiveness on several NLP tasks (Chalkidis et al, 2019, 2020; Zheng et al, 2021)
We will focus on two legal document prediction tasks, including European Court of Human Rights (ECHR) Violation Dataset (Chalkidis et al, 2021) and Overruling Task Dataset (Zheng et al, 2021)
We analyse the impacts of pre-training on legal judgement prediction (Chalkidis et al, 2019), different types of documents, especially infine-tuned Bidirectional Encoder Representations from Transformers (BERT) model

Summary

Related Work

Legal documents, such as EU & UK legislation, European Court of Human Rights (ECHR) cases, Case Holdings On Legal Decisions (CaseHOLD) are normally written in a descriptive language in a non-structured text format and have unique characteristics that are different from those of other domains. Bidirectional Encoder Representations from Transformers (BERT) is a language representation model that is optimized during pre-training by selfvariances of pre-trained BERT-based models and compare several methods to handle the long legal documents in legal text classification. Several attempts (Beltagy et al, 2020; Zaheer et al, 2020; Pappagari et al, 2019) have been made to enable BERT-like models to work on documents with more than 512 tokens. We adapt these techniques to learn how to effectively use BERT on long legal documents. RQ1 For legal text classification, does pre-training on the in-domain documents lead to a more effective performance than pre-training on general documents?. Transfer learning from a large dataset before fine- To answer the first research question, we compare tuning the model on a specific task Used a batch size of 16 and fine-tune the models on individual tasks for 5 epochs

Datasets

Experimental Setup

Overruling Task Dataset

Model Variances

ECHR Violation Dataset

Findings