Are Some Words Worth More than Others?

Shiran Dudy,Steven Bedrick

doi:10.18653/v1/2020.eval4nlp-1.13

Abstract

Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Are Some Words Worth More than Others?

Abstract

Talk to us

Similar Papers

More From: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Lead the way for us

Journal: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing	Publication Date: Jan 1, 2020
Citations: 25

Similar Papers

Performance of Large Language Models on a Neurology Board–Style Examination
Marc Cicero Schubert ... Varun Venkataramani
JAMA Network Open | VOL. 6
Marc Cicero Schubert, et. al.Marc Cicero Schubert ... Varun Venkataramani
07 Dec 2023
JAMA Network Open | VOL. 6

Large language model may assist diagnosis of SAPHO syndrome by bone scintigraphy.
Yu Mori ... Toshimi Aizawa
Japanese Journal of Rheumatology | VOL. -
Yu Mori, et. al.Yu Mori ... Toshimi Aizawa
28 Dec 2023
Japanese Journal of Rheumatology | VOL. -

BioInstruct: instruction tuning of large language models for biomedical natural language processing.
Hieu Tran ... Hong Yu
Journal of the American Medical Informatics Association : JAMIA | VOL. -
Hieu Tran, et. al.Hieu Tran ... Hong Yu
04 Jun 2024
Journal of the American Medical Informatics Association : JAMIA | VOL. -

The performance of large language models on quantitative and verbal ability tests: Initial evidence and implications for unproctored high‐stakes testing
Louis Hickman ... Jasper Leo Wolf
International journal of selection and assessment | VOL. -
Louis Hickman, et. al.Louis Hickman ... Jasper Leo Wolf
17 May 2024
International journal of selection and assessment | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Are Some Words Worth More than Others?

Abstract

Talk to us

Similar Papers

More From: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing