Abstract

e23198 Background: Labeling and identifying concepts from qualitative studies and patient interviews in oncology can be challenging and time consuming, often requiring several iterations. Large language models (LLMs) may be tailored to serve as a ‘co-pilot’, thus reducing errors and time to completion. We therefore aimed to examine whether utilizing LLMs to pre-label concepts (i.e., how patient’s feel or function) from medical oncology literature text can be done at comparable accuracy as human. To examine whether such LLM pre-labelling can dramatically reduce the time required for manual labor in conceptual development landscape reviews in oncology. Methods: We developed GPT-4 (OpenAI) prompts to pre-label concepts reported in abstracts of qualitative study references. We further developed GPT-4 prompts to categorize each pre-labeled concept into one of the following relevant high-level category: clinician reported symptoms, cognitive impacts, symptoms due to treatment, symptoms due to diagnosis, emotional impacts, impact on caregiver, other impacts, physical appearance, physical function, and social impacts. The prompts were extended to also capture diagnosis, treatments, patient reported outcomes, as well as generic terms or irrelevant terms that would otherwise erroneously be captured under the concept category in the early iterations of the prompt-engineering. The prompts were tested on 48, 37, and 38 (total of 123) eligible references from three conceptual development reviews covering colorectal cancer (CRC), non-small cell lung cancer (NSCLC), and acute myeloid leukemia (AML), respectively. Results: The LLM pre-labelled a total of 153, 131, and 159 candidate concepts across the CRC, NSCLC, and AML reviews, respectively. In all three reviews, clinical reported symptoms, emotional impacts, and social impacts all accounted for 18-25% each of the pre-labelled concepts, whereas physical function and social impact accounted for 5-9% each. A total of 2 (1.3%), 3 (2.3%) and 4 (2.5%) pre-labels were predicted to belong to the wrong high-level category. Incorrectly categorized labels were dispersed across most categories and should have been categorized as ‘other impacts’. In the NSCLC review, one pre-label categorized as impact on caregiver and two as other impacts were erroneously extracted concepts. Conclusions: LLMs are effective at capturing and pre-labelling concepts for further review, evident by the high number of concept labels extracted. Further high-level categorization with LLMs to guide manual reviewers also appears highly accurate.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call