To develop an advanced multi-task large language model (LLM) framework for extracting diverse types of information about dietary supplements (DSs) from clinical records. We focused on 4 core DS information extraction tasks: named entity recognition (2949 clinical sentences), relation extraction (4892 sentences), triple extraction (2949 sentences), and usage classification (2460 sentences). To address these tasks, we introduced the retrieval-augmented multi-task information extraction (RAMIE) framework, which incorporates: (1) instruction fine-tuning with task-specific prompts; (2) multi-task training of LLMs to enhance storage efficiency and reduce training costs; and (3) retrieval-augmented generation, which retrieves similar examples from the training set to improve task performance. We compared the performance of RAMIE to LLMs with instruction fine-tuning alone and conducted an ablation study to evaluate the individual contributions of multi-task learning and retrieval-augmented generation to overall performance improvements. Using the RAMIE framework, Llama2-13B achieved an F1 score of 87.39 on the named entity recognition task, reflecting a 3.51% improvement. It also excelled in the relation extraction task with an F1 score of 93.74, a 1.15% improvement. For the triple extraction task, Llama2-7B achieved an F1 score of 79.45, representing a significant 14.26% improvement. MedAlpaca-7B delivered the highest F1 score of 93.45 on the usage classification task, with a 0.94% improvement. The ablation study highlighted that while multi-task learning improved efficiency with a minor trade-off in performance, the inclusion of retrieval-augmented generation significantly enhanced overall accuracy across tasks. The RAMIE framework demonstrates substantial improvements in multi-task information extraction for DS-related data from clinical records.
Read full abstract