A large reproducible benchmark on text classification for the legal domain based on the ECHR-OD repository

Alexandre Quemy,Robert Wrembel,Natalia Łopuszyńska,George Papadakis,Agustín D Delgado

doi:10.1016/j.is.2023.102258

Abstract

This work is a companion reproducible paper of our experiments and results reported in a previous work Quemy and Wrembel (2022) introducing an open repository of legal documents, called ECHR-OD, together with a large benchmark of Machine Learning (ML) methods for text classification. Machine Learning (ML) algorithms are used in various domains, including banking, healthcare, manufacturing, energy management, security, trade or insurance. However, building reliable ML models is challenging. First, because in order to build prediction models by ML algorithms, massive amounts of pre-processed data are needed, but in practice, such datasets are scarce or require a tremendous amount of time to be prepared. Second, because once a model is built, its performance needs to be assessed. To this end, benchmarks are needed, but their availability is limited as well. Despite the fact that ML algorithms are used in multiple domains, their application to the legal domain so far has received little attention from research communities. This fact motivated us to run a project to build and make available an open repository called the European Court of Human Rights Open Data (ECHR-OD) of judgment documents. In this paper, we describe a step-by-step Extract, Transform, and Load (ETL) process, supported with code snippets, for building ECHR-OD, so that it can be easily reproduced. The process produces (almost) exhaustive datasets that have been transformed, homogenized, re-organized, cleaned beforehand, and made available in a suitable format for ML algorithms. The ECHR-OD repository makes available tabular descriptive features as well as features extracted from natural language documents, accessible via a web user interface. Moreover, we provide a self-contained and easily reproducible set of experiments assessing ML classification algorithms on the content of the ECHR-OD repository. To the best of our knowledge, the ETL process and the set of experiments form the first fully end-to-end, from ingesting and pre-processing legal documents to obtaining high quality ML models, open, and reproducible benchmark on the prediction of the European Court of Human Rights judgments. Both components, the ETL and the experiments, leverage Docker for reproducibility. The content of this paper weakly reproduces the original results and provides a new weakly reproducible set of experiments.

Full Text