Abstract Background Paediatric cardiology presents unique challenges with its diverse and complex cases, limited evidence base, and the necessity for multi-expert involvement in decision-making processes. In this context, the introduction of generative pre-trained transformer (GPT) based large language models (LLMs) offers a potential avenue for the provision of complex information and clinical decision support. Purpose This study evaluates the quality of three different GPT LLMs in answering complex medical questions, including a state-of-the-art preview model that incorporates the German paediatric cardiology guidelines. Methods Seven paediatric cardiologists and paediatric cardiac surgeons generated 72 questions, including complex questions and medical cases with associated questions. The questions were categorized by difficulty and required knowledge (factual and experience-based or mostly experience-based). We prompted the questions to three LLMs: GPT 3.5, GPT 4 and a GPT 4 turbo preview. The GPT 4 turbo preview was customized by incorporating all guidelines from the German Society for Paediatric Cardiology by a retrieval function. Employing one complex instruction for all questions, we prompted the LLMs to provide precise and detailed expert-level responses. The responses from each model were evaluated by experts based on relevance, factual accuracy, severity of possible harm, completeness, superfluous content, and age-related appropriateness from 0 (very bad) to 7 (very good). Differences were calculated using the Kruskal-Wallis-test in SPSS Version 28. Results Our findings indicated a good performance of all models regarding the dimensions tested. The figures show the average ratings (Figure 1, Figure 2A) and highlight significant differences after Bonferroni correction in bold (Figure 2B). The GPT 4 turbo preview, including the retrieval of guidelines, provided significantly more relevant (average rating [AR] 5.94, meaning mostly relevant), accurate (AR 5.6, meaning between somewhat and mostly accurate) and complete (AR 5, meaning fairly complete) answers compared to GPT 3.5 and GPT 4. In terms of difficulty levels or the type of questions, there was no significant difference in rating. Relevance ratings were slightly better in factual questions (AR 5.7) than in those requiring more experience-based knowledge (AR 5,3). Although GPT4 had higher average scores compared to GPT 3.5 in all dimensions except superfluous content, the differences in rating were not statistically significant. All models had relevant difficulties considering the age-related aspects of the questions (AR 4.06-4.45, p=0.455). Conclusion This study highlights the potential and limitations of AI language models in addressing complex medical questions in fields characterized by highly individualized decision-making scenarios. The findings advocate for the development of more specialized AI tools in medicine, tailored to specific medical fields and patient age groups.Fig 1:Average ratings of LLMsFig 2:Rating differences between LLMs