Summarization of Lengthy Legal Documents via Abstractive Dataset Building: An Extract-then-Assign Approach

Deepali Jain,Malaya Dutta Borah,Anupam Biswas

doi:10.1016/j.eswa.2023.121571

Abstract

Development of effective automatic summarization approaches for legal documents suffer from several challenges like extremely long document-summary pairs, lack of large scale training datasets with tractable document-summary token lengths, etc. In this work, we deal with the problem of legal document summarization by building a modified abstractive dataset from the original dataset. This ensures that the length of each document-summary pair is manageable and can be processed by the state of the art summarization approaches (such as BART). Secondly, we deal with the data scarcity problem by creating more number of training samples, from each of the original document-summary pair. This is done by creating multiple extractive summaries from each sample in the original dataset, following which ground-truth summary sentences are assigned to each of the extractive summary to generate new training samples. This results in a larger training dataset that can be utilized for fine-tuning summarization models. Our proposed approach has been evaluated on two different legal datasets- BillSum and Forum of Information Retrieval Evaluation (FIRE). With respect to the ROUGE metrics, the proposed approach is able to outperform pre-trained BART model fine-tuned on original dataset by (3−8)% for FIRE test sets, and by (1−3)% for the BillSum test sets. Considering the BERTScore metrics, the proposed approach obtains (1−2)% improvements on the FIRE test sets, while for the BillSum test sets (3−8)% improvements are observed. Such improvements suggest that the proposed dataset building approach can help achieve improved abstractive summarization of lengthy legal documents.

Full Text