Evaluating large language models for health-related text classification tasks with public social media data.

Yuting Guo,Anthony Ovadje,Mohammed Ali Al-Garadi,Abeed Sarker

doi:10.1093/jamia/ocae210

Abstract

Large language models (LLMs) have demonstrated remarkable success in natural language processing (NLP) tasks. This study aimed to evaluate their performances on social media-based health-related text classification tasks. We benchmarked 1 Support Vector Machine (SVM), 3 supervised pretrained language models (PLMs), and 2 LLMs-based classifiers across 6 text classification tasks. We developed 3 approaches for leveraging LLMs: employing LLMs as zero-shot classifiers, using LLMs as data annotators, and utilizing LLMs with few-shot examples for data augmentation. Across all tasks, the mean (SD) F1 score differences for RoBERTa, BERTweet, and SocBERT trained on human-annotated data were 0.24 (±0.10), 0.25 (±0.11), and 0.23 (±0.11), respectively, compared to those trained on the data annotated using GPT3.5, and were 0.16 (±0.07), 0.16 (±0.08), and 0.14 (±0.08) using GPT4, respectively. The GPT3.5 and GPT4 zero-shot classifiers outperformed SVMs in a single task and in 5 out of 6 tasks, respectively. When leveraging LLMs for data augmentation, the RoBERTa models trained on GPT4-augmented data demonstrated superior or comparable performance compared to those trained on human-annotated data alone. The results revealed that using LLM-annotated data only for training supervised classification models was ineffective. However, employing the LLM as a zero-shot classifier exhibited the potential to outperform traditional SVM models and achieved a higher recall than the advanced transformer-based model RoBERTa. Additionally, our results indicated that utilizing GPT3.5 for data augmentation could potentially harm model performance. In contrast, data augmentation with GPT4 demonstrated improved model performances, showcasing the potential of LLMs in reducing the need for extensive training data. By leveraging the data augmentation strategy, we can harness the power of LLMs to develop smaller, more effective domain-specific NLP models. Using LLM-annotated data without human guidance for training lightweight supervised classification models is an ineffective strategy. However, LLM, as a zero-shot classifier, shows promise in excluding false negatives and potentially reducing the human effort required for data annotation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating large language models for health-related text classification tasks with public social media data.

Abstract

Talk to us

Similar Papers

More From: Journal of the American Medical Informatics Association : JAMIA

Lead the way for us

Journal: Journal of the American Medical Informatics Association : JAMIA	Publication Date: Aug 9, 2024
License type: CC BY 4.0

Similar Papers

A self-supervised language model selection strategy for biomedical question answering
Negar Arabzadeh ... Ebrahim Bagheri
Journal of Biomedical Informatics | VOL. 146
Negar Arabzadeh, et. al.Negar Arabzadeh ... Ebrahim Bagheri
16 Sep 2023
Journal of Biomedical Informatics | VOL. 146

A Large and Diverse Arabic Corpus for Language Modeling
Abbas Raza Ali ... Hasan Raza Ali
Procedia Computer Science | VOL. 225
Abbas Raza Ali, et. al.Abbas Raza Ali ... Hasan Raza Ali
01 Jan 2023
Procedia Computer Science | VOL. 225

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
Jingfeng Yang ... Ruixiang Tang
ACM Transactions on Knowledge Discovery from Data | VOL. 18
Jingfeng Yang, et. al.Jingfeng Yang ... Ruixiang Tang
26 Apr 2024
ACM Transactions on Knowledge Discovery from Data | VOL. 18

Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias
Anoop K ... Lajish V L
-
Anoop K, et. al. Anoop K ... Lajish V L
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating large language models for health-related text classification tasks with public social media data.

Abstract

Talk to us

Similar Papers

More From: Journal of the American Medical Informatics Association : JAMIA