A Survey on Evaluation of Large Language Models

Yupeng Chang,Kaijie Zhu,Qiang Yang,Yidong Wang,Xu Wang,Linyi Yang,Jindong Wang,Wei Ye,Cunxiang Wang,Yue Zhang,Hao Chen,Xiaoyuan Yi,Xing Xie,Yuan Wu,Yi Chang,Philip S Yu

doi:10.1145/3641289

Abstract

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate , where to evaluate , and how to evaluate . Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Survey on Evaluation of Large Language Models

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Intelligent Systems and Technology

Lead the way for us

Journal: ACM Transactions on Intelligent Systems and Technology	Publication Date: Mar 29, 2024
Citations: 187

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Use of SNOMED CT in Large Language Models: Scoping Review.
Eunsuk Chang ... Sumi Sung
JMIR medical informatics | VOL. 12
Eunsuk Chang, et. al.Eunsuk Chang ... Sumi Sung
07 Oct 2024
JMIR medical informatics | VOL. 12

Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions
Kendall A Flaharty ... Benjamin D Solomon
The American Journal of Human Genetics | VOL. 111
Kendall A Flaharty, et. al.Kendall A Flaharty ... Benjamin D Solomon
14 Aug 2024
The American Journal of Human Genetics | VOL. 111

A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly
Yifan Yao ... Yue Zhang
High-Confidence Computing | VOL. 4
Yifan Yao, et. al.Yifan Yao ... Yue Zhang
01 Mar 2024
High-Confidence Computing | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Survey on Evaluation of Large Language Models

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Intelligent Systems and Technology