Objectives: Determine if a large language model (LLM, GPT-4) can label and consolidate and analyze interventional radiology (IR) microwave ablation device safety event data into meaningful summaries similar to humans. Methods: Microwave ablation safety data from January 1, 2011 to October 31, 2023 were collected and type of failure was categorized by human readers. Using GPT-4 and iterative prompt development, the data were classified. Iterative summarization of the reports was performed using GPT-4 to generate a final summary of the large text corpus. Results: Training (n = 25), validation (n = 639), and test (n = 79) data were split to reflect real-world deployment of an LLM for this task. GPT-4 demonstrated high accuracy in the multiclass classification problem of microwave ablation device data (accuracy [95% CI]: training data 96.0% [79.7, 99.9], validation 86.4% [83.5, 89.0], test 87.3% [78.0, 93.8]). The text content was distilled through GPT-4 and iterative summarization prompts. A final summary was created which reflected the clinically relevant insights from the microwave ablation data relative to human interpretation but had inaccurate event class counts. Conclusion: The LLM emulated the human analysis, suggesting feasibility of using LLMs to process large volumes of IR safety data as a tool for clinicians. It accurately labelled microwave ablation device event data by type of malfunction through few-shot learning. Content distillation was used to analyze a large text corpus (>650 reports) and generate an insightful summary which was like the human interpretation.
Read full abstract