Background: Various large language models (LLMs) can provide human-level medical discussions, but they have not been compared regarding rhinoplasty knowledge. Objective: To compare the leading LLMs in answering complex rhinoplasty consultation questions as evaluated by plastic surgeons. Methods: Ten open-ended rhinoplasty consultation questions were presented to ChatGPT-4o, Google Gemini, Claude, and Meta-AI LLMs. The responses were randomized and ranked by seven rhinoplasty-specializing plastic surgeons (1 = worst, 4 = best) considering their quality. Textual readability was analyzed via Flesch Reading Ease (FRE) and Flesch-Kincaid Grade (FKG). Results: Claude provided the top answers for seven questions while ChatGPT provided the top answers for three questions. In overall collective scoring, Claude provided the best answers with 224 points, followed by ChatGPT's 200, Meta's 138, and Gemini's 138 scores. Claude (mean score/question 3.20 ± 1.00) significantly outperformed all the other models (p < 0.05), while ChatGPT (mean score/question 2.86 ± 0.94) outperformed Meta and Gemini. Meta and Gemini performed similarly. Meta had a significantly lower FKG than Claude and ChatGPT and a significantly lower FRE than ChatGPT. Conclusion: According to ratings by seven rhinoplasty-specializing surgeons, Claude provided the best answers for a set of complex rhinoplasty consultation questions, followed by ChatGPT. Future studies are warranted to continue comparing these models as they evolve.
Read full abstract