This study evaluates the effectiveness of generative artificial intelligence (GAI) in identifying and reconstructing legal arguments from judges’ reasons in court cases, focusing on the practical implications for law students and legal educators. By examining the performance of two versions of popular Large Language Models – ChatGPT and Claude – across five recent High Court of Australia decisions, the study makes a preliminary assessment of the accuracy of LLM systems in replicating a skill essential for lawyers: identification of arguments and argument chains in judges’ reasons. The methodology involves marking LLM-generated outputs with reference to both a sample answer and a detailed rubric. Key findings reveal a significant variance in the accuracy of different LLMs, with Claude 3.5 markedly outperforming all others, achieving average grades up to 90 per cent. In contrast, ChatGPT versions demonstrated lower accuracy, with average marks not exceeding 50 per cent. These results highlight the critical importance of selecting the right GAI system for legal applications, as well as the necessity for users to critically engage with AI outputs rather than relying solely on automated tools. The study concludes that while LLMs hold potential benefits for the legal profession, including increased efficiency and enhanced access to justice, for GAI use that may be carried out by a law student, the technology cannot yet replace the nuanced human skill of legal argument analysis.
Read full abstract