Investigating the clinical reasoning abilities of large language model GPT-4: an analysis of postoperative complications from renal surgeries

Jessica Y Hsueh,Daniel Nethala,Shiva Singh,W Marston Linehan,Mark W Ball

doi:10.1016/j.urolonc.2024.04.010

Abstract

PurposeLarge language models, a subset of artificial intelligence, have immense potential to support human tasks. The role of these models in science and medicine is unclear, requiring strong critical thinking and analysis skills. The objective of our study was to evaluate GPT-4's abilities to assess postoperative complications after renal surgeries. Materials and methodsDischarge summaries were compiled, and patient information was deidentified in a Python-based program. Prompts were engineered in GPT-4 to assess for the presence of postoperative complications. GPT-4 was further asked to interpret each complication's Clavien-Dindo classification and institutional-specific category. GPT-4's database was compared to a human-curated database. Discrepancies were manually reviewed to calculate match and accuracy rates. ResultsApproximately 944 renal surgeries were conducted from August 2005 to March 2022. There was a 79.6% match rate between GPT-4 and human-curated data in detecting postoperative complications. Accuracy rates were 86.7% for GPT-4 and 92.9% for human-curated. A subgroup of 139 patients had a complication detected by both GPT-4 and human with available Clavien-Dindo classification and category information. There was a 37.4% overall match rate for Clavien-Dindo grade and 55.4% match rate for category. ConclusionsGPT-4 was able to accurately detect if there were any postoperative complications. It struggled with the complex task of further analyzing complications, especially with Clavien-Dindo classification, which requires more critical thinking and interpretation. While GPT-4 is not yet ready for advanced postoperative complication analysis, it can still be used to support clinicians in this endeavor.

Full Text