Abstract
Abstract Automated text classification is a fundamental research topic within the legal domain as it is the foundation for building many intelligent legal solutions. There is a scarcity of publicly available legal training data and these classification algorithms struggle to perform in low data scenarios. Text augmentation techniques have been proposed to enhance classifiers through artificially synthesised training data. In this paper we present and evaluate a combination of rule-based and advanced generative text augmentation methods designed to create additional training data for the task of classification of legal contracts. We introduce a repurposed CUAD contract dataset, modified for the task of document level classification, and compare a deep learning distilBERT model with an optimised support vector machine baseline for useful comparison of shallow and deep strategies. The deep learning model significantly outperformed the shallow model on the full training data (F1-score of 0.9738 compared to 0.599). We achieved promising improvements when evaluating the combined augmentation techniques on three reduced datasets. Augmentation caused the F1-score performance to increase by 66.6%, 17.5% and 2.6% for the 25%, 50% and 75% reduced datasets respectively, compared to the non-augmented baseline. We discuss the benefits augmentation can bring to low data regimes and the need to extend augmentation techniques to preserve key terms in specialised domains such as law.
Highlights
One of the fundamental applications of natural language processing (NLP) is text classification
This low data scenario has set an upper limit of deep learning model performance and hindered the big data revolution in legal machine learning compared to other fields of study [7]
We have demonstrated that sentence-level data augmentation can yield significant improvements for the domain of legal document classification
Summary
One of the fundamental applications of natural language processing (NLP) is text classification. Deep learning systems have achieved state-of-the-art performance in many complex Natural Language Processing tasks, such as classification. These algorithms tend to under-perform and exhibit strong variance when insufficient data is provided [5]. Concerns about safeguarding privacy have resulted in a lack of publicly available annotated legal data, which in turn has been an obstacle to the development of robust legal text classification systems. This low data scenario has set an upper limit of deep learning model performance and hindered the big data revolution in legal machine learning compared to other fields of study [7]
Paper version not known (
Free)
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have