Can Large Language Models Understand Real-World Complex Instructions?

Qianyu He,Lina Chen,Yanghua Xiao,Xunzhe Zhou,Jiaqing Liang,Jin Xiao,Qianxi He,Wenhao Huang,Jie Zeng

doi:10.1609/aaai.v38i16.29777

Abstract

Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs’ ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Can Large Language Models Understand Real-World Complex Instructions?

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 1

Similar Papers

Performance of Large Language Models on a Neurology Board–Style Examination
Marc Cicero Schubert ... Varun Venkataramani
JAMA network open | VOL. 6
Marc Cicero Schubert, et. al.Marc Cicero Schubert ... Varun Venkataramani
07 Dec 2023
JAMA network open | VOL. 6

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
Ivan Civettini ... Carlo Gambacorti-Passerini
Blood | VOL. 142
Ivan Civettini, et. al.Ivan Civettini ... Carlo Gambacorti-Passerini
02 Nov 2023
Blood | VOL. 142

Zero- and few-shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials.
Simon Šuster ... Timothy Baldwin
Research synthesis methods | VOL. 15
Simon Šuster, et. al.Simon Šuster ... Timothy Baldwin
23 Aug 2024
Research synthesis methods | VOL. 15

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions.
Oskitz Ruiz Sarrias ... Covadonga Figaredo Berjano
Cancers | VOL. 16
Oskitz Ruiz Sarrias, et. al.Oskitz Ruiz Sarrias ... Covadonga Figaredo Berjano
12 Aug 2024
Cancers | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Can Large Language Models Understand Real-World Complex Instructions?

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence