Performance analysis of large language models in the domain of legal argument mining.

Abdullah Al Zubaer,Jelena Mitrović,Michael Granitzer

doi:10.3389/frai.2023.1278796

Abstract

Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Performance analysis of large language models in the domain of legal argument mining.

Abstract

Talk to us

Similar Papers

More From: Frontiers in artificial intelligence

Lead the way for us

Journal: Frontiers in artificial intelligence	Publication Date: Nov 17, 2023
License type: CC BY 4.0

Similar Papers

Activists in international courts: Backlash, funding, and strategy in international legal mobilization
Freek Van Der Vet ... Lisa Mcintosh Sundstrom
Law & Society Review | VOL. 57
Freek Van Der Vet, et. al.Freek Van Der Vet ... Lisa Mcintosh Sundstrom
01 Mar 2023
Law & Society Review | VOL. 57

The European Court of Human Rights as a Factor in Improving Human Rights Implementation in Ukraine
A.M Kuchuk
Analytical and Comparative Jurisprudence | VOL. -
A.M KuchukA.M Kuchuk
27 Dec 2023
Analytical and Comparative Jurisprudence | VOL. -

Enhancing Legal Argument Mining with Domain Pre-training and Neural Networks
Gechuan Zhang ... Paul Nulty
Journal of Data Mining & Digital Humanities | VOL. NLP4DH
Gechuan Zhang, et. al.Gechuan Zhang ... Paul Nulty
10 Jun 2022
Journal of Data Mining & Digital Humanities | VOL. NLP4DH

Enhancing Legal Argument Mining with Domain Pre-training and Neural Networks
...
Zenodo (CERN European Organization for Nuclear Research) | VOL. -
, et. al. ...
28 Feb 2022
Zenodo (CERN European Organization for Nuclear Research) | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance analysis of large language models in the domain of legal argument mining.

Abstract

Talk to us

Similar Papers

More From: Frontiers in artificial intelligence