Advancing health coaching: A comparative study of large language model and health coaches

Qi Chwen Ong,Chin-Siang Ang,Davidson Zun Yin Chee,Ashwini Lawate,Frederick Sundram,Mayank Dalakoti,Leonardo Pasalic,Daniel To,Tatiana Erlikh Fox,Iva Bojic,Josip Car

doi:10.1016/j.artmed.2024.103004

Abstract

ObjectiveRecent advances in large language models (LLM) offer opportunities to automate health coaching. With zero-shot learning ability, LLMs could revolutionize health coaching by providing better accessibility, scalability, and customization. The aim of this study is to compare the quality of responses to clients' sleep-related questions provided by health coaches and an LLM. Design, setting, and participantsFrom a de-identified dataset of coaching conversations from a pilot randomized controlled trial, we extracted 100 question-answer pairs comprising client questions and corresponding health coach responses. These questions were entered into a retrieval-augmented generation (RAG)-enabled open-source LLM (LLaMa-2-7b-chat) to generate LLM responses. Out of 100 question-answer pairs, 90 were taken out and assigned to three groups of evaluators: experts, lay-users, and GPT-4. Each group conducted two evaluation tasks: (Task 1) a single-response quality assessment spanning five criteria—accuracy, readability, helpfulness, empathy, and likelihood of harm—rated on a five-point Likert scale, and (Task 2) a pairwise comparison to choose the superior response between pairs. A suite of inferential statistical methods, including the paired and independent sample t-tests, Pearson correlation, and chi-square tests, were utilized to answer the study objective. Recognizing potential biases in human judgment, the remaining 10 question-answer pairs were used to assess inter-evaluator reliability among the human evaluators, quantified using the interclass correlation coefficient and percentage agreement metrics. ResultsUpon exclusion of incomplete data, the analysis included 178 single-response evaluations (Task 1) and 83 pairwise comparisons (Task 2). Expert and GPT-4 assessments revealed no discernible disparities in health coach and LLM responses across the five metrics. Contrarily, lay-users deemed LLM responses significantly more helpful than that of human coaches (p < 0.05). LLM responses were preferred in the majority (62.25 %, n = 155) of the aggregate 249 assessments, with all three evaluator groups favoring LLM over health coach inputs. While GPT-4 rated both health coach and LLM responses significantly higher than experts in terms of readability, helpfulness, and empathy, its ratings on accuracy and likelihood of harm aligned with those of experts. Response length positively correlated with accuracy and empathy scores, but negatively affected readability across all evaluator groups. Expert and lay-user evaluators demonstrated moderate to high inter-evaluator reliability. ConclusionOur study showed encouraging findings by demonstrating that RAG-enabled LLM has comparable performance to health coaches in the domain tested. Serving as an initial step towards the creation of more sophisticated, adaptive, round-the-clock automated health coaching systems, our findings call for more extensive evaluation which could assist in the development of the model that could in the future lead to potential clinical implementation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Advancing health coaching: A comparative study of large language model and health coaches

Abstract

Talk to us

Similar Papers

More From: Artificial Intelligence In Medicine

Lead the way for us

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
Ivan Civettini ... Paola Perfetti
Blood | VOL. 142
Ivan Civettini, et. al.Ivan Civettini ... Paola Perfetti
02 Nov 2023
Blood | VOL. 142

On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models
Majid Afshar ... Dina Demner-Fushman
Journal of Biomedical Informatics | VOL. 157
Majid Afshar, et. al.Majid Afshar ... Dina Demner-Fushman
13 Aug 2024
Journal of Biomedical Informatics | VOL. 157

Development of an observational tool to assess health coaching fidelity
Stephanie J Sohl ... Ruth Q Wolever
Patient Education and Counseling | VOL. 104
Stephanie J Sohl, et. al.Stephanie J Sohl ... Ruth Q Wolever
09 Sep 2020
Patient Education and Counseling | VOL. 104

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Advancing health coaching: A comparative study of large language model and health coaches

Abstract

Talk to us

Similar Papers

More From: Artificial Intelligence In Medicine