Abstract Background and Aims The generative pre-trained transformer (GPT), a type of large language model (LLM), is now playing a huge role in driving innovation in fields of medical education, diagnosis and treatment and with a process of continuous improvement it will itself be able to become beacons of sustainability. Medical professionals expect that they may have help from LLMs to diagnose patients more accurately and efficiently, but it is unclear whether current LLMs are well-trained and validated on real-world clinical data. In this study, we compared the performance of ChatGPT, a representative LLM, MDCalc, an online medical calculator, and human nephrologist for their diagnostic accuracies of acid-base gas analysis in critically ill cases. Method This study included 130 patients admitted to intensive care unit with varying medical conditions. All variables were obtained during the first 24 hours after admission. Results The Fleiss’ Kappa between the interpretations of nephrologist, ChatGPT and MDCalc with the acid-base gas analysis for acid-base disorders showed that Fleiss’ Kappa value was −0.138 (95% CI −0.216 to −0.059), indicating no agreement among the judgments of human doctor, LLM and online medical calculator in critically ill patients for the interpretation of acid-base status. MDCalc showed that all patients had mixed acid-base disorders while nephrologist that 4 patients had simple acid-base disorder and 1 patient had normal acid-base balance. By contrast, ChatGPT reported that 51 patients had only simple acid-base disorder. Furthermore, according to ChatGPT, normal acid-base balance was found in 27 patients, who were all diagnosed as having acid-base disorders by MDCalc or nephrologist. Conclusion We found that current ChatGPT does not yet provide better diagnostic performance for interpreting acid-base balance in critically ill patients who with usually have mixed acid-base disorders compared with an existing online medical calculator or nephrologist.